Array Database Assessment WG Activity Overview Summary

Summary

Creator

Discussion
May 21, 2016 at 1:51 pm #137880
Peter Baumann
Participant
Arrays form a basic data category next to sets, hierarchies, and general graphs. As such, it may be possible to emulate array management and processing, but such architectures typically result in reduced functionality, constrained performance, and limited scalability. Hence, adding array support to data management is feasible and effectively has created the new technology class of Array Databases. Such systems can give meaningful support for all array-intensive domains, generally speaking: sensor, image, simulation, and statistics data. It is fair to say that a large part of today’s “Big Data” can be represented meaningfully as arrays of some dimensionality, possibly augmented with metadata (such as georeferencing).

Benchmarks have shown superiority over standard technology like relational and MapReduce-type systems, also scalability is proven through operational services. Standardization has picked up on the topic as well, which will form an additional stimulus for both open-source and proprietary tool developers to jump on this trending technology.

Analytics Support

Today’s Array DBMSs support efficient data extraction (sometimes paired with support for domain-specific data formats) plus server-side processing (“ship code to data”). This already allows to establish practically relevant service functionality, as is shown by the large-scale installations available today as well as the “datacube” standards existing and emerging. In this sense, the “Big Data” promise can be considered fulfilled. The “Big Data Analytics” quest, however, makes higher-level functionality desirable, in particular: general Linear Algebra (tensor math) and safe iterations (“loops”). Both is currently not available, but under active research in the community.

Data Integration

An active area of research is querying databases that hold data of different information categories (so-called “polystores”), such as mixed set and array data. Practically, such combinations appear when combining data and metadata – for example, tabular metadata plus arrays or hierarchical XML metadata plus arrays. Traditionally, arrays have been kept in separate silos due to the inability of the general data management systems to provide adequate support for large arrays. Array databases show a way for integration due to their declarative query language approach which is compatible with metadata query languages. The different architectures underneath can be hidden from users through appropriate mediator technology. Work on array integration has been done on
- sets: the ISO SQL/MDA standard, which is based on the rasdaman query language, integrates multi-dimensional arrays into SQL; [2]
- hierarchies: the xWCPS language extends the OGC WCPS geo array language with metadata retrieval [3];
- (knowledge) graphs: first research has been done on integration arrays into RDF/SPARQL databases [1].
Summary

tbd: Strengths and weaknesses, domains and applications benefitting most, any other guidance.

References
1. A. Andrejev, P. Baumann, D. Misev, and T. Risch: Spatio-Temporal Gridded Data Processing on the Semantic Web. 2015 IEEE Intl. Conf. on Data Science and Data Intensive Systems (DSDIS 2015), Sydney, Australia, December 11-13, 2015
2. D. Misev, P. Baumann: Enhancing Science Support in SQL. Proc. Workshop Data and Computational Science Technologies for Earth Science Research (co-located with IEEE Big Data), Santa Clara, US, October 29, 2015
3. P. Liakos, P. Koltsida, G. Kakaletris, P. Baumann: xWCPS: Bridging the Gap Between Array and Semi-structured Data. Proc. Knowledge Engineering and Knowledge Management, Springer 2015
Creator

Discussion

Array Database Assessment WG

Group Organizers

Summary

Analytics Support

Data Integration

Summary

References