Array Database Assessment WG Activity Overview Array Systems

Array Systems

Creator

Discussion
May 21, 2016 at 1:45 pm #137883
Peter Baumann
Participant
This page collects technology for handling massive multi-dimensional arrays. While emphasis is on Array Databases, other technologies addressing arrays are mentioned as well as long as substantial array support can be evidenced. Please observe Etiquette (see bottom).

Array databases naturally can do the “heavy lifting” in multi-dimensional access and processing, but arrays in practice never ome alone; rather, they are ornamented with application-specific metadata that are critical for understanding of the array data and for querying them appropriately. For example, in geo datacubes querying is done typically on geographical coordiantes, such asl latitude and longitude; the ssystem needs to be able to translate queries in geo coordinates into the native Cartesian index coordinates o arrays. In all applications using timeseries, users will want to hutilize date formats – such as ISO 8601 supporting syntax like “2018-02-20” – rather than index counting. For cell types, it is not sufficient to just know about integer versus floating-point numbers, but it is important to know about units of measure, null values (note that sensor data do not just deliver one null value, such as traditional databases suppoort, but multiple null values with individual semantics).

Coupling array queries with metadata query capabilities, therefore, is of high practical importance; ISO SQL/MDA, with its integration of arrays into the rich existing framework of the SQL language, shows one possible way. If that appears too complex to implement, silo solutions with datacube support are established. Specifically in the Earth science domain an explosion of domain-specific “datacube” solutions can be observed recently (see, e.g., the EGU 2018 datacube session), usually implemented in python using existing array libraries. We, therefore, also look at domain-specific “datacube” tools as well.

This state of the art review on array service implementations is organised as follows. First, Array Databases are inspected which offer generic query and architectural support for n-D arrays. Next, known object-relational emulations of arrays are listed. MapReduce-type systems follow as a substantially different category of data systems, which however often is mentioned in the context of Bi Data. After that, systems are listed which do not fall into any of the above categoris. Finally, we list libraries (as opposed to the aforementioned complete engines) and n-D array data formats.

Array Databases

rasdaman (“raster data manager”)
- Description: Rasdaman has pioneered the field of Array Databases, with publications since 1992. This array engine allows declarative querying of massive multi-dimensional arrays, including distributed array joins. Server-side processing relies on effective optimization, parallelization, and use of heterogeneous hardware for retrieval, extraction, aggregation, and fusion on distributed arrays. The architecture resembles a peer federation without a single point of failure. Arrays can be stored in the optimized rasdaman array store or in standard databases; further, rasdaman can operate directly on any pre-existing archive structure. Single rasdaman databases exceed 250 TB, and queries have been split successfully across more than 1,000 cloud nodes. The rasdaman technology has coined the research field of Array Databases and is blueprint for several Big Data standards, such as the ISO SQL/MDA (Multi-DImensional Arrays) candidate standard and the OGC Web Coverage Service (WCS) “Big Geo Data” suite with its geo datacube query language, Web Coverage Processing Service (WCPS).
- Source code: http://www.rasdaman.org/Download for open-source rasdaman community edition; see http://www.rasdaman.com for the proprietary rasdaman enterprise edition.
- Public demo site and further information:
  - http://standards.rasdaman.com
  - publications, in particular:
    
    P. Baumann: A Database Array Algebra for Spatio-Temporal Data and Beyond. Proc. Intl. Workshop on Next Generation Information Technologies and Systems (NGITS ’99), July 5-7, 1999, Zikhron Yaakov, Israel, Springer LNCS 1649
    
    Peter Baumann: On the Management of Multidimensional Discrete Data. VLDB Journal 4(3)1994, Special Issue on Spatial Database Systems, pp. 401 – 444
    
    Peter Baumann: Language Support for Raster Image Manipulation in Databases. Proc. Int. Workshop on Graphics Modeling, Visualization in Science & Technology, Darmstadt/Germany, April 13 – 14, 1992
SciDB
- Description: SciDB is an Array DBMS following the tradition of rasdaman. SciDB employs its own query interface offering two languages, AQL (Array Query Language) and AFL (Array Functional Language). Its architecture is based on a modified Postgres kernel in the center plus UDFs (User-Defined Functions) effecting parallelization.
- Website:
- Source code:
SciQL
- Description: SciQL was a case study extending the column-store DBMS MonetDB with array-specific operators. As such, n-D arrays were mapped internally to (1-D) tables (i.e., there is no dedicated storage and processing engine).
- Website: https://projects.cwi.nl/scilens/content/platform.html
- Source code: (could not find it yet – not with MonetDB)
Extascid
- Description:
- Website:
- Source code:
Object-Relational Database Extensions

Object-relational capabilities in relational DBMSs allow users (usually: administrators) to define new data types as well as new operators. Such data types can be used for column definitions, and the corresponding operators can be used in queries. While this approach has been implemented by several systems (see below) it encounters two main shortcomings:
- An array is not a data type, but a data type constructor (sometimes called “template”). An instructive example is a stack: likewise, it is not a datatype but a template which needs to be instantiated with some element datatype to form a cpreserving oncrete datatype itself – for example, by instantiating Stack with String – often denoted as Stack – one particular datatype is obtained; Stack would be another one. An array template is parametrized with the dimension and extent as well as the cell (“pixe”, “voxel”) datatype; following the previously introduced syntax this might be written as Array. Hence, object-relational systems cannot provide the array abstraction as such, but only instantiated datatypes like Array or Array . Further, as the SQL syntax as such cannot be extended such array support needs to introduce some separate array expression language. Generic datatypes like the rasdaman n-D array constructor become difficult at best. Further, this approach typically implies particular implementation restrictions (see next).
- Due to the genericity of such object-relational mechanisms there is no dedicated internal support for storage management (in particular: for efficient spatial clustering, but also for array sizes), indexing, and query optimization.
PostGIS Raster
- Description: “Raster” is a PostGIS type for storing and analyzing geo raster data. Like PostGIS in general, it is implemented using the extension capabilities of the PostgreSQL object-relational DBMS. Internally, raster processing relies heavily on GDAL. Currently, PostGIS Raster supports 2D and, to some extent, 3D rasters. It allows raster expressions, however, not integrated with the PostgreSQL query language but passed to a raster object as strings written in a separate Map Algebra language. Large objects have to be partitioned by the user and distributed over tuples in a table’s raster column; queries have to be written in a way that they achieve a proper recombination of larger rasters from the partitions stored in one tuple each. A recommended partition size is 100×100 pixels.
- Website: http://postgis.net/docs/manual-2.1/RT_reference.html
- Source code: https://trac.osgeo.org/postgis/wiki/DevWikiMain
Oracle GeoRaster
- Description: GeoRaster is a feature of Oracle Spatial that lets you store, index, query, analyze, and deliver raster image and gridded data and its associated metadata. GeoRaster provides Oracle spatial data types and an object-relational schema. You can use these data types and schema objects to store multidimensional grid layers and digital images that can be referenced to positions on the Earth’s surface or in a local coordinate system. If the data is georeferenced, you can find the location on Earth for a cell in an image; or given a location on Earth, you can find the cell in an image associated with that location. There is no particular raster query language underneath, nor a specific array-centric architecture.
- Website: http://docs.oracle.com/cd/B19306_01/appdev.102/b14254/geor_intro.htm
- Source code: n.a. (closed source, proprietary)
Teradata Arrays
- Description: Teradata recently has added arrays as a datatype, also following an object-relational approach. There are some fundamental operations such as subsetting; however, overall the operator do not resemble the expressive power of genuine Array DBMSs. Further, arrays are mapped to 64 kB blobs so that the overall size of a single array (considering the array metadata stored in each blob) seems to be around 40 kB. Further, there are severe restrictions: You can update only one element of the array at a time; it us unclear whether array joins are supported.
- Website: https://developer.teradata.com/database/reference/array-data-type-scenario, http://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Refer…
- Source code: n.a. (closed source, proprietary)
MapReduce-Type Systems

Overview

MapReduce offers a general parallel programming paradigm which is based on two user-implemented functions, Map() and Reduce(). While Map() performs filtering and sorting, Reduce() acts as an aggregator. Both functions are instantiated multiple time for massive parallelization; the MapReduce engine manages the process instances as well as their communication.

Iimplementations of the MapReduce paradigm – such as Hadoop, Spark, and Flink – typically use Java or Scala for the Map() and Reduce() coding. While these languages offer array primitives for processing multi-dimensional arrays locally within a Map() and Reduce() incarnation here is no particular support for arrays exceeding local server main memory; in particular, the MapReduce engines are not aware of the spatial n-dimensional proximity of array partitions. Hence, the common MapReduce optimizations cannot exploit the array semantics. Essentially, MapReduce is particularly well suited for unstructured data like sets. “Since it was not originally designed to leverage the structure its performance is suboptimal.” [Daniel Abadi]

That said, attempts have been made to implement partitioned array management and processing on top of MapReduce. Below some major approaches are listed.

SciHadoop
- Description: SciHadoop is a Hadoop plugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes queries as map/reduce programs defined over the logical data model. A SciHadoop prototype has been implemented for NetCDF data sets.
- Website: SciHadoop is being developed by the DAMASC group
- Source code: https://github.com/four2five/SciHadoop
GeoTrellis
- Description: GeoTrellis is a geographic data processing engine for high performance applications. GeoTrellis provides data types for working with rasters in the Scala language, as well as fast reading and writing of these data types to disk.
- Website: http://geotrellis.io/
- Source code: https://github.com/geotrellis
MrGeo
- Description: MrGeo (pronounced “Mister Geo”) is an open source geospatial toolkit designed to provide raster-based geospatial processing capabilities performed at scale. MrGeo enables global geospatial big data image processing and analytics. MrGeo is built upon the Apache Spark distributed processing frarmework.
- Website: https://github.com/ngageoint/mrgeo/wiki
- Source code: https://github.com/ngageoint/mrgeo
Unclassified

Google Earth Engine
- Description: Google Earth Engine Google Earth Engine builds on the tradition of Grid systems with files, there is no datacube paradigm. Based on a functional programming language, users can submit code which is executed transparently in Google’s own distributed environment, with a worldwide private network. Parallelization is straightforward. After discussion of the developers with the rasdaman team, Google has added a declarative “Map Algebra” interface in addition which resembles a subset of the rasdaman query language. In a face-to-face conversation at the “Big Data from Space” conference 2016, Noel Gorelick (EarthEngine Chief Architect) explained that EarthEngine is relying on Google’s massive hardware rather than on algorithmic elaboration. At the heart is a functional programming language which does not offer specific array primitives like rasdaman, nor any comparable optimization.
- Website: https://earthengine.google.com/
- Source code: n.a., closed-source, proprietary system
OPeNDAP
- Description: OPeNDAP (“Open-source Project for a Network Data Access Protocol”) is a data transport architecture and protocol for earth scientists. OPeNDAP includes standards for encapsulating structured data, annotating the data with attributes and adding semantics that describe the data. An OPeNDAP client could be an ordinary browser, although this gives limited functionality. Usually, an OPeNDAP client is a graphics program (like GrADS, Ferret or ncBrowse) or web application (like DChart) linked with an OPeNDAP library. An OPeNDAP client sends requests to an OPeNDAP server, and receives various types of documents or binary data as a response. One such document is called a DDS (received when a DDS request is sent), that describes the structure of a data set. (Wikipedia)
  An Array is a one-dimensional indexed data structure similar to that defined by ANSI C. An Array’s member variable MAY be of any DAP data type. Array indexes MUST start at zero. Multidimensional Arrays are defined as Arrays of Arrays. Multi-dimensional Arrays MUST be stored in row-major order (as is the case with ANSI C). The size of each Array’s dimensions MUST be given. The number of elements in an Array is fixed as that given by the size(s) of its dimension(s). A constraint expression provides a way for DAP client programs to request certain variables, or parts of certain variables, from a data source. A constraint expression MAY also use functions executed by the server. These can appear in a selection or in a projection, although there are restrictions about the data types functions can return. See this source for details.
- Website: http://www.opendap.org/
- Source code: http://www.opendap.org/software/hyrax-data-server (Hyrax)
TensorFlow
- tbd
OpenDataDatacube
- Description: The Open Data Cube (ODC) initiative seeks to increase the value and impact of global Earth observation satellite data by providing an open and freely accessible exploitation architecture. (source). The API specification can be found at http://datacube-core.readthedocs.io/en/stable/dev/api.html
- Website: https://www.opendatacube.org/
Tools and Libraries

Ophidia
- Description: The Ophidia framework is a CMCC Foundation research effort addressing big data challenges in several scientific domains (e.g. mainly climate, but also astrophysics and downstream communities). It provides a full software stack for data analytics and management of big scientific datasets exploiting a hierarchically distributed storage along with parallel, in-memory computation techniques and a server-side approach. The Ophidia data model implements the data cube abstraction to support the processing of multi-dimensional (array-based) data. A wide set of operators provides functionalities to run data analytics and metadata management: e.g. data sub-setting, reduction, statistical analysis, mathematical computations, and much more. So far about 50 operators are provided in the current release, jointly with about 100 primitives covering a large set of array-based functions. The framework provides support for executing workflows with various sizes and complexities, and an end-user terminal. A programmatic Python interface is also available for developers.
- Website: http://ophidia.cmcc.it/
- Source code: The Ophidia code is available on GitHub under GPLv3 license at https://github.com/OphidiaBigData
- YouTube channel: https://www.youtube.com/user/OphidiaBigData
xarray
- Description: xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.Goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file. [source: xarray.pydata.org/en/stable/]
- Website: http://xarray.pydata.org/en/stable/
xtensor
- Description: xtensor is a C++ library meant for numerical analysis with multi-dimensional array expressions. xtensor provides an extensible expression system enabling lazy broadcasting, an API following the idioms of the C++ standard library, and tools to manipulate array expressions and build upon xtensor. Containers of xtensor are inspired by NumPy, the Python array programming library. Adaptors for existing data structures to be plugged into our expression system can easily be written. In fact, xtensor can be used to process numpy data structures inplace using Python’s buffer protocol. For more details on the numpy bindings, check out the xtensor-python project. (source: website)
- Website: http://quantstack.net/xtensor
wendelin.core
- Description: Wendelin.core allows you to work with arrays bigger than RAM and local disk. Bigarrays are persisted to storage, and can be changed in transactional manner. In other words bigarrays are something like numpy.memmap for numpy.ndarray and OS files, but support transactions and files bigger than disk. The whole bigarray cannot generally be used as a drop-in replacement for numpy arrays, but bigarray slices are real ndarrays and can be used everywhere ndarray can be used, including in C/Cython/Fortran code. Slice size is limited by virtual address-space size, which is ~ max 127TB on Linux/amd64. (source: website)
- Website: https://lab.nexedi.com/nexedi/wendelin.core
- Source code: https://lab.nexedi.com/nexedi/wendelin.core
TileDB
- Description: tbd
- Website: tbd
- Source code: tbd
Data Formats

Basically, data formats are out of scope of this investigation as any good array technology would allow to ingest and deliver individually encoded data, as chosen by the user. Nevertheless, for reasons of completeness, we mention some of the major multi-dimensional data formats here.
- HDF
- NetCDF: NetCDF-4 is based on HDF-5.
Etiquette

Know a system not listed? Feel free to add it, adhering to the following etiquette:
–   For every system mentioned, clearly indicate whether it is available in open-source or proprietary
–   For every feature mentioned, clearly indicate whether it is available in open-source or proprietary
–   Avoid marketing lingo
–   Be crisp; descriptions must not exceed 100 characters including whitespace. There is ample space for feature and performance description in the assessment section.
–   Be complete, providing information for each item
–   Systems listed do not have to be Array Databases in the strict sense, but they must be capable of handling multi-dimensional arrays.
Entries violating netiquette run the risk of being shifted into the Dungeon Zone.
Creator

Discussion

Array Database Assessment WG

Group Organizers

Array Systems

Array Databases

rasdaman (“raster data manager”)

SciDB

SciQL

Extascid

Object-Relational Database Extensions

PostGIS Raster

Oracle GeoRaster

Teradata Arrays

MapReduce-Type Systems

Overview

SciHadoop

GeoTrellis

MrGeo

Unclassified

Google Earth Engine

OPeNDAP

TensorFlow

OpenDataDatacube

Tools and Libraries

Ophidia

xarray

xtensor

wendelin.core

TileDB

Data Formats

Etiquette