Array Database Assessment Recommendations

  Array Database Assessment Working Group

Recommendation Title: Array Databases: Concepts, Standards, Implementations

DOI: dx.doi.org/10.15497/RDA00024

Authors: Peter Baumann1, Dimitar Misev1, Vlad Merticariu1, Bang Pham Huu1, Brennan Bell1, Kwo-Sen Kuo2

1 Jacobs University, Large-Scale Scientific Information Systems Research Group, Bremen, Germany

Bayesics, LLC / NASA USA

Contributors: RDA Array Database Assessment Working Group members

Executive Summary 

Multi-dimensional arrays (also known as raster data or gridded data) play a core role in many, if not all science and engineering domains where they typically represent spatio-temporal sensor, image, simulation output, or statistics “datacubes”. However, as classic database technology does not support arrays adequately, such data today are maintained mostly in silo solutions, with architectures that tend to erode and have difficulties keeping up with the increasing requirements on service quality.

Array Database systems attempt to close this gap by providing declarative query support for flexible ad- hoc analytics on large n-D arrays, similar to what SQL offers on set-oriented data, XQuery on hierarchical data, and SPARQL or CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity.

To elicit the state of the art in Array Databases, Research Data Alliance (RDA) has established the Array Database Assessment Working Group (ADA:WG) as a spin-off from the Big Data Interest Group. Between September 2016 and March 2018, the ADA:WG has established an introduction to Array Database technology, a comparison of Array Database systems and related technology, a list of pertinent standards with tutorials, and comparative benchmarks to essentially answer the question: how can data scientists and engineers benefit from Array Database technology?

Investigation of altogether 19 systems shows that there is a lively ecosystem of technology with increasing uptake, and proven array analytics standards are in place. Tools, though, vary greatly in functionality, performance, and maturity as investigation shows. On one end of the spectrum we find Petascale proven systems which parallelize across 1,000+ cloud nodes, on the other end some systems appear as lab prototypes which still have to find their way into large-scale practice. In comparison to other array services (MapReduce type systems, command line tools, libraries, etc.) Array Databases can excel in aspects like service friendliness to both users and administrators, standards adherence, and often performance. As it turns out, Array Databases can offer significant advantages in terms of flexibility, functionality, extensibility, as well as performance and scalability – in total, their approach of offering “datacubes” analysis-ready heralds a new level of service quality. Consequently, they have to be considered as a serious option for “Big DataCube” servicees in science, engineering and beyond.

The outcome of this investigation, a unique compilation and in-depth analysis of the state of the art in Array Databases, is supposed to provide beneficial insight for both technologists and decision makers considering “Big Array Data” services in both academic and industrial environments.

Review period start: 
Tuesday, 20 March, 2018 to Friday, 20 April, 2018
Group content visibility: 
Use group defaults
  • Rainer Stotzka's picture

    Author: Rainer Stotzka

    Date: 12 Apr, 2018

    (The thoughts I am describing in this comment reflect my personal opinions as a RDA member.)

    Type of output

    Dear members of the RDA WG Array Database Assessment,

    Thank you very much for your report on Array Databases. Array databases, their technologies and implementations are an important topic for RDA and data sharing.

    The report describes the state-of-the-art and compares various systems systematically from various perspectives. It shows a snapshot of the current situation and concludes with the need for further research.

    I am not an expert in array databases, but I have the feeling that the report depicts a scientific study which I would recommend as an excellent reading material for newcomers in this field. Considering the types of RDA outputs (https://rd-alliance.org/recommendations-outputs) I would label the report rather as a “supporting output” or “other output” than a “RDA recommendation”.

  • Rainer Stotzka's picture

    Author: Rainer Stotzka

    Date: 13 Apr, 2018

    (The thoughts I am describing in this comment reflect my personal opinions as a RDA member.)

    Consensus

    The research field of array databases very narrow making it hard to bring together the expertise from various continents to RDA. This was also reflected in the very low participation in the last plenary meetings and the email communications of the WG.

    The authors of the report consist of five researchers from Jacobs University and one from LLC / NASA. It seems that at least three authors are not RDA members at all.

    To my knowledge we don’t have in RDA a clear definition how and when consensus is reached which is sufficient for a balanced RDA output.

    I would feel more comfortable if the report listed a couple of more authors from other locations and who ideally also contributed to the development of a variety of array db systems.

  • Lesley Wyborn's picture

    Author: Lesley Wyborn

    Date: 17 Apr, 2018

    I have read the Array Database Assessment Working Group final report. It is a very good summary of the current state of play in in Big Data analytical systems and will provide reference material to anyone new to the field and even to more experienced people. It notes that there is a lively ecosystem of technologies available and that this is one of the more comprehensive reviews of 19 of the systems that are available.

    However, I feel that there are some issues need to be clarified in this report as follows:

    1. Is this final report an RDA State of the Art report or a recommendation? Given that there are no actual recommendations specified in either the Executive Summary or in the Summary in Section 8 on page 68, I have trouble see how this report can be endorsed as an RDA ‘recommendation’. I cannot see what RDA would be recommending in this report. It is much more a report on the State of the Art of Array Database systems. This view is endorsed by the statement in the final paragraph on Page 69 which quotes “no one size meets all”.
    2. I note that Section 6 – Publicly Accessible Array Services (pp 31-32) lists a ‘selection of publicly accessible services (In RDA terminology: adopters)’ – but what have they adopted as part of this RDA report? Each organisation on pages 31-32 is listed as running rasdaman, but I know that several of these sites also support other systems listed in this report. For example, NCI supports rasdaman enterprise, OPeNDAP and the Open Data Cube, but the report has NCI listed as only having rasdaman. Further, the majority of the organisations listed in Section 6 are also participants in the EarthServer 2 H2020 project and as I understand it, they installed rasdaman enterprise as part of that project. Does this then make them adopters of an RDA Recommendation? By this logic anyone who installs rasdaman (or any of the other 18 Array Database Systems reveiwed) could be interpreted as an ‘adopter’ of this RDA ‘recommendation’.
    3. As noted on page 23, there are two editions of rasdaman: the open source rasdaman community edition (www.rasdaman.org) and the proprietary rasdaman enterprise edition (www.rasdaman.com ). Given that rasdaman enterprise has differing functionalities from the open source rasdaman community edition, it would be useful if the terms rasdaman enterprise vs rasdaman open source were used consistently in the report to differentiate between the two. For example, it is not explicitly clear which edition of rasdaman has been used in the assessment tables in pp 37-59, nor in the Performance Comparison section in pp 62-67.
    4. I have concerns on the level of participation in the development of, and in reaching consensus on the report from two perspectives: firstly, with in the RDA community itself, and secondly, with the developers/experts of each of the 18 systems other than rasdaman reviewed in this report.
    5. Within the RDA community, and specifically in the Array Database Assessment Working Group, it was not widely known that this report had been completed and was about to be submitted to council for endorsement – for example on the Array Database Assessment Working Group Mailing list archive there is no evidence that it the report emailed to the group prior to its submission to council (https://www.rd-alliance.org/node/50325/archive-post-mailinglist ). Although there was a session on the Array Database Assessment Working Group at RDA Plenary 11 in Berlin where the report was released (https://www.rd-alliance.org/wg-array-database-assessment-rda-11th-plenary-meeting ), I believe that less than 10 people attended this session. Also, given that the Array Database Assessment Working Group is a spinoff of from the Big Data Interest Group, I was surprised it was not publicised there, as there could be additional expertise in the Big Data Interest Group capable of contributing to this report. Further, as noted on the Array Database Assessment Working Group WIKI, “the wiki pages have been copied into an MS-Word document to produce the final PDF formatted result, and the wiki pages below are obsoleted.” This makes it very hard for members of the Working Group who were unable to attend RDA Berlin to have been able to contribute to the report, prior to its submission to council for review.
    6. Externally to RDA, were groups such as the Open Data Cube, Ophidia, Sci DB, etc given time to review not only the references on which the assessment is based to ensure that the most appropriate references were being used for the assessment, but also to review the summary tables, performance comparison and benchmarks in the report. If this was not done, then I would recommend that before this report is distributed under the RDA banner that each be given the opportunity to validate what has been written on their system: some are also members of RDA so this should not be too onerous.
    7. In the Performance Comparison Section (pp 62-67) the details of how the Benchmarking was undertaken is not clear. It states on page 62 that “the Benchmark code is available as part of the Rasdaman source code at www.rasdaman.org”, but this was not easy to find. If this benchmarking was done as part of an RDA project, I would expect the code to be more accessible and not part of the Rasdaman source code. I would also expect it to be more readily available so that the benchmarks could be independently tested and replicated and reapplied as more systems come on board.
    8. Also in the Performance Comparison section only 4 of the 19 systems were included in the benchmarking: these were chosen as ‘representative’. (Note that in Section 7.6.1 under systems tested (Page 62), is says that “three systems have been measured, but it then lists four – rasdaman, SciDB, PostGIS Raster and the Open Data Cube”). It does not explicity specify which version of rasdaman (enterprise or open) was used in the Benchmarking. It also states that “These represent three (sic) Array DBMSs with different implementation paradigms; hence, the choice can be considered representative for the field. Open Data Cue (sic) was chosen as a representative of array tools based on scripting languages. Not present are MapReduce - type systems, due to resource constraints – this is left for future investigation”. This means that the graph show in Figure 8 on page 66 is also only representative and not necessarily a definitive benchmark of all 19 systems and that more work is needed to be done to complete the performance comparison.

    In view of the issues raised (some of which have also been raised by Rainer), I feel that this report needs more revision and more exposure to the RDA Array Database Assessment Working Group, as well as to the groups whose systems have been reviewed in this report. In addition, there are typographic errors that need to be addressed.

     

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    Dear all,

    thanks you for your detailed feedback, which allows me to respond to several items(disclaimer: I am only talking for myself here). As I need to phase this into a full agenda I will do it piecemeal and - apologies - with likely some time delay. So this post is the first of series, thereby trying to disentangle discussion.

    First, consensus: as is stated by Rainer, rules about consensus seem nonexisting with RDA at this time. That's fine, building up such an organization is always a stepwise process as I know from own experience. Hence, on this occasion it is a good idea maybe to initiate discussion so that for the future rules can be agreed so as to close this gap.

    However, it should be a matter of fairness to not apply rules in retrospect - first let people invest substantial work for 1.5 years, watch closely, and at submission time tell them "that's not what we want".

    RDA is very much carried by volunteers, and this precious resource should not be wasted.

    cheers,

    Peter

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    back again. Turns out that this was our fault, and I feel as coordinating author I am to blame in the first place: findings should have been phrased as recommendations, syntactically. What the report should epress is, in a quick & likely dirty shot:

    1 - For services on massive multi-dimensional arrays ("datacubes"), it is recommended to use array databases - they have proven mature and scalable to Petabytes, and further offer the advantage of "any query, any time" flexibility through their query languages.

    2 - For the decision on a particular system, various aspects are relevant, including functionality, standards conformance, flexibility, scalability, performance. It is recommended to make a weighted decision based on the information provided in this report, rather than looking at any one criterion in isolation.

    3 - As tuning can make a significant difference in performance, it is recommended to use the tuning parameters of array databases, based on the listing for the systems in this report, together with the further literature referenced.

    4 - Due to the remarkable variety of datacube interfaces found it is recommended to base services on open standards so as to avoid vendor lock-in.

    5 - Array services are trending under the keyword "datacubes", hence the landscape of tools is devloping quickly. It is recommended to continuously watch it, and also to extend the benchmarks which, due to resource reasons, necessarily could not cover all tools.

    best,

    Peter

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    Lesley, concerning adoption: given that the core question, as per Charter, was: "can Array Databases be used?" , an adoption obviously means: Array Databases are used. This is what the report collects. Of course there are zillions of services with whatever solution, but that was out of scope as per Charter.

    What's your point against research projects using Array Databases? I guess in Research Data Alliance we mainly rely on those in our work.

    You write "By this logic anyone who installs rasdaman (or any of the other 18 Array Database Systems reveiwed) could be interpreted as an ‘adopter’ of this RDA ‘recommendation’." Absolutely! Any large-scale installation of any Array DBMS, in conjunction of this report, is a proof of concept for usability. Again, see the Chater where this has been stated clearly: to seek real-life, large-scale installations.

    -Peter

     

     

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    Participation:

    The Charter was published widely. We had plenaries with open discussions. We have 40 members who have subscribed willingly, thereby expressing interest. So there was ample opportunity to contribute andor review. Of course, there is always more that can be done, but (i) volunteer resources are limited, unfortunately, and (ii) I had the belief that RDA itself would spread word - which did not happen, as I learnt only later.

    Those complaining about missing participation I would cordially invite to implement it giving a shining example - engagement is the fuel of RDA. A few of us have taken action, and IMHO my co-contributors deserve that their work gets acknowledged by both activists and spectators.

    So next time let's get all hands in for a joint endeavour!

    -Peter

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    Support. RDA claims to offer an environment supportive to scientists. Unfortunately, this is not always the case to the extent desirable. Examples include:

    - the Wiki is not configured correctly, it make scollaboration difficult. And we tentatively used the wiki until the very last phase of translation to Word/PDF, despite these difficulties. My various requests to the maintainers remained unheard, unfortunately.

    - I was submitting the report in the confidence that RDA would take all necessary steps, including informing relevant audiences. As I learnt yesterday this has not been done.

    Of course, we can do it all ourselves - in theory. In practice, we have resource constraints. Now I will send out an email to those people having expressed interest by joining this WG plus the Big Data IG. But it is not entirely satisfying that RDA misses important tasks and we get credited for that.

    -Peter

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    Lesley, you write: Further, as noted on the Array Database Assessment Working Group WIKI, “the wiki pages have been copied into an MS-Word document to produce the final PDF formatted result, and the wiki pages below are obsoleted.” This makes it very hard for members of the Working Group who were unable to attend RDA Berlin to have been able to contribute to the report, prior to its submission to council for review.

    If you read on you find the report uploaded and accessible on that page, so I fail to see how someone cannot contribute. Further, the Wiki was available for 1.5 years (!) for contribution - we tentatively did it the hard way, through a misconfigured Wiki, to be open for any and all contributions.

    -Peter

  • Peter Baumann's picture

    Author: Peter Baumann

    Date: 17 Apr, 2018

    Lesley, you observe that we wrote about 3 systems where it was 4. Indeed, this is a mistake (in fact, a coordination issue) and I am taking on responsibility, will fix it. To be exact: 3 Array DMBSs (rasdaman, PostGIS Raster, SciDB) + 1 related tool (Open Data Cube, not an Array DBMS) = 4 systems have been benchmarked.

    -Peter

     

  • Ben Evans's picture

    Author: Ben Evans

    Date: 18 Apr, 2018

    A few comments on the document.

    The document is quite useful and an interesting read due to some detailed work to capture an interesting survey perspective on a class of datacube-style systems.  I use the word Survey because I think its a better description at the moment rather than a recommendation.  The document asserts Arrays are motivated to be the solution to a wide range of problems. But its not clear that Arrays is equal to solving the scientific problems. Its hard to be definitive about this since the other software makers need to have commented about their approach.

    Even though the technology reviewed have some similarities, it comes through that its not a uniform landscape. Unfortunately its not so clear what or how broadly any independent client software are using these array standards as an interface.  It could be because there is no well-known client software taking this approach, though they may be doing bits.  I also can't see the case for interoperability based on standards without it.

    I also suggest that the benchmark results are interesting, but not easy to be convinced by them.  This area is *hard* work, and I think its really beyond what should be expected of this document.   However, I think this should be re-cast to be just a proposal to say that "here is a first go at what a test methodology" for Arrays and then the document could go on to describe that better.  I don't easily see the relationship to Big Data problems based on the results, so its not as interesting as what it first seems. The results themselves, especially trying to compare all the different solutions, unfortunately can't be easily cited without more work and resolving some of the ambiguities.

    Anyway, I would like to see some way that the document is actually resolved to be something without needing wholesale rewrites.  Its a substantial contribution and effort to bring this to light, and it helps more clearly ask questions about the nature of the various datacube approaches being used and perhaps where it is going.

     

  • Simon Cox's picture

    Author: Simon Cox

    Date: 19 Apr, 2018

    Peter - 

    There is no question that engagement was enabled, to the extent to which the RDA infrastructure allowed. While the absence of other contributions might be taken to signify consent, it might also show lack of time or interest, and definitely does not satisfy realistic expectations as evidence of consensus. I agree that RDA's procedures do not provide an explicit threshold or mechanism to demonstrate consensus. But where 5/6 of the authors are from one research team, and the one who isn't is tagged on as the last author, does not make a compelling case. 

    I strongly agree with the other commenters that this is a significant piece of work, and should be published. But not as a 'Recommendation'. 

  • David Gavin's picture

    Author: David Gavin

    Date: 19 Apr, 2018

    Hello, I am the technical lead for Digital Earth Australia, a principal contributor to the Open Data Cube (ODC) software and initiative. First off, I would like to thank you for your work within this working group in raising awareness of Array Databases concepts and for the inclusion of ODC within your study. As your paper indicates, ODC is not an Array Database, but a Python based scripting interface which parses user queries via its API onto respective datasets residing on file systems with the help of a relational database, returning the resulting geodata as Python xarrays. Three core paradigms of ODC which we would appreciate were reflected in this paper:

    - The focus on providing a scalable platform for scientific work across multiple compute platforms, ranging from desktop to cloud to super-computer workloads. 

     - The ability to index and access data without the need for ingestion. Ingestion is a data transform step to reformat from the source format to a custodian-managed format or a compute-optimised format. Indexing creates the necessary database records and retains the source format;

    - The python environment was chosen for its wide applicability in the science community. This allows users and developers to connect to additional data analysis libraries and develop new features.

    We understand that our existing documentation does not convey ODC’s ability to access and work with un-ingested data and seems to imply that data ingestion is mandatory when it is in fact an optional step.

    As we are always seeking to improve both the performance and the impact of our software, we are keen to understand and ultimately replicate the benchmarks that you have performed as part of this paper. To that end, we would deeply appreciate if your paper could include:

    - The size, format and internal structure of the data/area of interest you included  as part of your benchmark;

    - As ODC typically stores data on file systems or within cloud object stores, details around the type of filesystem and underlying storage hardware used;

    - An appendix including the exact Python scripts used for each of the test as well as which versions of any additional Python modules used;

    - Correct and consistant references to the OpenDataCube codebase (https://github.com/opendatacube) and documentation (https://datacube-core.readthedocs.io/en/latest/)

    I would like to make myself and other members of ODC community available to you if there is any assistance or further details we can provide.

submit a comment