898 research outputs found
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture
We present the architecture behind Twitter's real-time related query
suggestion and spelling correction service. Although these tasks have received
much attention in the web search literature, the Twitter context introduces a
real-time "twist": after significant breaking news events, we aim to provide
relevant results within minutes. This paper provides a case study illustrating
the challenges of real-time data processing in the era of "big data". We tell
the story of how our system was built twice: our first implementation was built
on a typical Hadoop-based analytics stack, but was later replaced because it
did not meet the latency requirements necessary to generate meaningful
real-time results. The second implementation, which is the system deployed in
production, is a custom in-memory processing engine specifically designed for
the task. This experience taught us that the current typical usage of Hadoop as
a "big data" platform, while great for experimentation, is not well suited to
low-latency processing, and points the way to future work on data analytics
platforms that can handle "big" as well as "fast" data
- …