1,292 research outputs found
Static and dynamic semantics of NoSQL languages
We present a calculus for processing semistructured data that spans
differences of application area among several novel query languages, broadly
categorized as "NoSQL". This calculus lets users define their own operators,
capturing a wider range of data processing capabilities, whilst providing a
typing precision so far typical only of primitive hard-coded operators. The
type inference algorithm is based on semantic type checking, resulting in type
information that is both precise, and flexible enough to handle structured and
semistructured data. We illustrate the use of this calculus by encoding a large
fragment of Jaql, including operations and iterators over JSON, embedded SQL
expressions, and co-grouping, and show how the encoding directly yields a
typing discipline for Jaql as it is, namely without the addition of any type
definition or type annotation in the code
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML
OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a
database of all EC FP7 and H2020 funded research projects, including metadata
of their results (publications and datasets). These data are stored in an HBase
NoSQL database, post-processed, and exposed as HTML for human consumption, and
as XML through a web service interface. As an intermediate format to facilitate
statistical computations, CSV is generated internally. To interlink the
OpenAIRE data with related data on the Web, we aim at exporting them as Linked
Open Data (LOD). The LOD export is required to integrate into the overall data
processing workflow, where derived data are regenerated from the base data
every day. We thus faced the challenge of identifying the best-performing
conversion approach.We evaluated the performances of creating LOD by a
MapReduce job on top of HBase, by mapping the intermediate CSV files, and by
mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc
Deterministic Automata for Unordered Trees
Automata for unordered unranked trees are relevant for defining schemas and
queries for data trees in Json or Xml format. While the existing notions are
well-investigated concerning expressiveness, they all lack a proper notion of
determinism, which makes it difficult to distinguish subclasses of automata for
which problems such as inclusion, equivalence, and minimization can be solved
efficiently. In this paper, we propose and investigate different notions of
"horizontal determinism", starting from automata for unranked trees in which
the horizontal evaluation is performed by finite state automata. We show that a
restriction to confluent horizontal evaluation leads to polynomial-time
emptiness and universality, but still suffers from coNP-completeness of the
emptiness of binary intersections. Finally, efficient algorithms can be
obtained by imposing an order of horizontal evaluation globally for all
automata in the class. Depending on the choice of the order, we obtain
different classes of automata, each of which has the same expressiveness as
CMso.Comment: In Proceedings GandALF 2014, arXiv:1408.556
A Brief Study of Open Source Graph Databases
With the proliferation of large irregular sparse relational datasets, new
storage and analysis platforms have arisen to fill gaps in performance and
capability left by conventional approaches built on traditional database
technologies and query languages. Many of these platforms apply graph
structures and analysis techniques to enable users to ingest, update, query and
compute on the topological structure of these relationships represented as
set(s) of edges between set(s) of vertices. To store and process Facebook-scale
datasets, they must be able to support data sources with billions of edges,
update rates of millions of updates per second, and complex analysis kernels.
These platforms must provide intuitive interfaces that enable graph experts and
novice programmers to write implementations of common graph algorithms. In this
paper, we explore a variety of graph analysis and storage platforms. We compare
their capabil- ities, interfaces, and performance by implementing and computing
a set of real-world graph algorithms on synthetic graphs with up to 256 million
edges. In the spirit of full disclosure, several authors are affiliated with
the development of STINGER.Comment: WSSSPE13, 4 Pages, 18 Pages with Appendix, 25 figure
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Blazes: Coordination Analysis for Distributed Programs
Distributed consistency is perhaps the most discussed topic in distributed
systems today. Coordination protocols can ensure consistency, but in practice
they cause undesirable performance unless used judiciously. Scalable
distributed architectures avoid coordination whenever possible, but
under-coordinated systems can exhibit behavioral anomalies under fault, which
are often extremely difficult to debug. This raises significant challenges for
distributed system architects and developers. In this paper we present Blazes,
a cross-platform program analysis framework that (a) identifies program
locations that require coordination to ensure consistent executions, and (b)
automatically synthesizes application-specific coordination code that can
significantly outperform general-purpose techniques. We present two case
studies, one using annotated programs in the Twitter Storm system, and another
using the Bloom declarative language.Comment: Updated to include additional materials from the original technical
report: derivation rules, output stream label
Kevoree Modeling Framework (KMF): Efficient modeling techniques for runtime use
The creation of Domain Specific Languages(DSL) counts as one of the main
goals in the field of Model-Driven Software Engineering (MDSE). The main
purpose of these DSLs is to facilitate the manipulation of domain specific
concepts, by providing developers with specific tools for their domain of
expertise. A natural approach to create DSLs is to reuse existing modeling
standards and tools. In this area, the Eclipse Modeling Framework (EMF) has
rapidly become the defacto standard in the MDSE for building Domain Specific
Languages (DSL) and tools based on generative techniques. However, the use of
EMF generated tools in domains like Internet of Things (IoT), Cloud Computing
or Models@Runtime reaches several limitations. In this paper, we identify
several properties the generated tools must comply with to be usable in other
domains than desktop-based software systems. We then challenge EMF on these
properties and describe our approach to overcome the limitations. Our approach,
implemented in the Kevoree Modeling Framework (KMF), is finally evaluated
according to the identified properties and compared to EMF.Comment: ISBN 978-2-87971-131-7; N° TR-SnT-2014-11 (2014
- …