164 research outputs found
Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a
database of all EC FP7 and H2020 funded research projects, including metadata
of their results (publications and datasets). These data are stored in an HBase
NoSQL database, post-processed, and exposed as HTML for human consumption, and
as XML through a web service interface. As an intermediate format to facilitate
statistical computations, CSV is generated internally. To interlink the
OpenAIRE data with related data on the Web, we aim at exporting them as Linked
Open Data (LOD). The LOD export is required to integrate into the overall data
processing workflow, where derived data are regenerated from the base data
every day. We thus faced the challenge of identifying the best-performing
conversion approach.We evaluated the performances of creating LOD by a
MapReduce job on top of HBase, by mapping the intermediate CSV files, and by
mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc
The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes
In the current context of Big Data, a multitude of new NoSQL solutions for
storing, managing, and extracting information and patterns from semi-structured
data have been proposed and implemented. These solutions were developed to
relieve the issue of rigid data structures present in relational databases, by
introducing semi-structured and flexible schema design. As current data
generated by different sources and devices, especially from IoT sensors and
actuators, use either XML or JSON format, depending on the application,
database technologies that store and query semi-structured data in XML format
are needed. Thus, Native XML Databases, which were initially designed to
manipulate XML data using standardized querying languages, i.e., XQuery and
XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently,
the majority of these solutions have been replaced with the more modern JSON
based Database Management Systems. However, we believe that XML-based solutions
can still deliver performance in executing complex queries on heterogeneous
collections. Unfortunately nowadays, research lacks a clear comparison of the
scalability and performance for database technologies that store and query
documents in XML versus the more modern JSON format. Moreover, to the best of
our knowledge, there are no Big Data-compliant benchmarks for such database
technologies. In this paper, we present a comparison for selected
Document-Oriented Database Systems that either use the XML format to encode
documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB,
CouchDB, and Couchbase. To underline the performance differences we also
propose a benchmark that uses a heterogeneous complex schema on a large DBLP
corpus.Comment: 28 pages, 6 figures, 7 table
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
Translation of Heterogeneous Databases into RDF, and Application to the Construction of a SKOS Taxonomical Reference
International audienceWhile the data deluge accelerates, most of the data produced remains locked in deep Web databases. For the linked open data to benefit from the potential represented by this huge amount of data, it is crucial to come up with solutions to expose heterogeneous databases as linked data. The xR2RML mapping language is an endeavor towards this goal: it is designed to map various types of databases to RDF, by flexibly adapting to heterogeneous query languages and data models while remaining free from any specific language. It extends R2RML, the W3C recommendation for the mapping of relational databases to RDF, and relies on RML for the handling of various data formats. In this paper we present xR2RML, we analyse data models of several modern databases as well as the format in which query results are returned , and we show how xR2RML translates any result data element into RDF, relying on existing languages such as XPath and JSONPath when necessary. We illustrate some features of xR2RML such as the generation of RDF collections and containers, and the ability to deal with mixed data formats. We also describe a real-world use case in which we applied xR2RML to build a SKOS thesaurus aimed at supporting studies on History of Zoology, Archaeozoology and Conservation Biology
PAXQuery: A Massively Parallel XQuery Processor
International audienceWe present a novel approach for parallelizing the execution of queries over XML documents, implemented within our system PAXQuery. We compile a rich subset of XQuery into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data
Yedalog: Exploring Knowledge at Scale
With huge progress on data processing frameworks, human programmers are frequently the bottleneck when analyzing large repositories of data. We introduce Yedalog, a declarative programming language that allows programmers to mix data-parallel pipelines and computation seamlessly in a single language. By contrast, most existing tools for data-parallel computation embed a sublanguage of data-parallel pipelines in a general-purpose language, or vice versa. Yedalog extends Datalog, incorporating not only computational features from logic programming, but also features for working with data structured as nested records. Yedalog programs can run both on a single machine, and distributed across a cluster in batch and interactive modes, allowing programmers to mix different modes of execution easily
- …