234 research outputs found
Object Graph Programming
We introduce Object Graph Programming (OGO), which enables reading and
modifying an object graph (i.e., the entire state of the object heap) via
declarative queries. OGO models the objects and their relations in the heap as
an object graph thereby treating the heap as a graph database: each node in the
graph is an object (e.g., an instance of a class or an instance of a metadata
class) and each edge is a relation between objects (e.g., a field of one object
references another object). We leverage Cypher, the most popular query language
for graph databases, as OGO's query language. Unlike LINQ, which uses
collections (e.g., List) as a source of data, OGO views the entire object graph
as a single "collection". OGO is ideal for querying collections (just like
LINQ), introspecting the runtime system state (e.g., finding all instances of a
given class or accessing fields via reflection), and writing assertions that
have access to the entire program state. We prototyped OGO for Java in two
ways: (a) by translating an object graph into a Neo4j database on which we run
Cypher queries, and (b) by implementing our own in-memory graph query engine
that directly queries the object heap. We used OGO to rewrite hundreds of
statements in large open-source projects into OGO queries. We report our
experience and performance of our prototypes.Comment: 13 pages, ICSE 202
Analytical Engines With Context-Rich Processing: Towards Efficient Next-Generation Analytics
As modern data pipelines continue to collect, produce, and store a variety of
data formats, extracting and combining value from traditional and context-rich
sources such as strings, text, video, audio, and logs becomes a manual process
where such formats are unsuitable for RDBMS. To tap into the dark data, domain
experts analyze and extract insights and integrate them into the data
repositories. This process can involve out-of-DBMS, ad-hoc analysis, and
processing resulting in ETL, engineering effort, and suboptimal performance.
While AI systems based on ML models can automate the analysis process, they
often further generate context-rich answers. Using multiple sources of truth,
for either training the models or in the form of knowledge bases, further
exacerbates the problem of consolidating the data of interest.
We envision an analytical engine co-optimized with components that enable
context-rich analysis. Firstly, as the data from different sources or resulting
from model answering cannot be cleaned ahead of time, we propose using online
data integration via model-assisted similarity operations. Secondly, we aim for
a holistic pipeline cost- and rule-based optimization across relational and
model-based operators. Thirdly, with increasingly heterogeneous hardware and
equally heterogeneous workloads ranging from traditional relational analytics
to generative model inference, we envision a system that just-in-time adapts to
the complex analytical query requirements. To solve increasingly complex
analytical problems, ML offers attractive solutions that must be combined with
traditional analytical processing and benefit from decades of database
community research to achieve scalability and performance effortless for the
end user
Building Efficient Query Engines in a High-Level Language
Abstraction without regret refers to the vision of using high-level
programming languages for systems development without experiencing a negative
impact on performance. A database system designed according to this vision
offers both increased productivity and high performance, instead of sacrificing
the former for the latter as is the case with existing, monolithic
implementations that are hard to maintain and extend. In this article, we
realize this vision in the domain of analytical query processing. We present
LegoBase, a query engine written in the high-level language Scala. The key
technique to regain efficiency is to apply generative programming: LegoBase
performs source-to-source compilation and optimizes the entire query engine by
converting the high-level Scala code to specialized, low-level C code. We show
how generative programming allows to easily implement a wide spectrum of
optimizations, such as introducing data partitioning or switching from a row to
a column data layout, which are difficult to achieve with existing low-level
query compilers that handle only queries. We demonstrate that sufficiently
powerful abstractions are essential for dealing with the complexity of the
optimization effort, shielding developers from compiler internals and
decoupling individual optimizations from each other. We evaluate our approach
with the TPC-H benchmark and show that: (a) With all optimizations enabled,
LegoBase significantly outperforms a commercial database and an existing query
compiler. (b) Programmers need to provide just a few hundred lines of
high-level code for implementing the optimizations, instead of complicated
low-level code that is required by existing query compilation approaches. (c)
The compilation overhead is low compared to the overall execution time, thus
making our approach usable in practice for compiling query engines
Parallel programming paradigms and frameworks in big data era
With Cloud Computing emerging as a promising new approach for ad-hoc parallel data processing, major companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. We have entered the Era of Big Data. The explosion and profusion of available data in a wide range of application domains rise up new challenges and opportunities in a plethora of disciplines-ranging from science and engineering to biology and business. One major challenge is how to take advantage of the unprecedented scale of data-typically of heterogeneous nature-in order to acquire further insights and knowledge for improving the quality of the offered services. To exploit this new resource, we need to scale up and scale out both our infrastructures and standard techniques. Our society is already data-rich, but the question remains whether or not we have the conceptual tools to handle it. In this paper we discuss and analyze opportunities and challenges for efficient parallel data processing. Big Data is the next frontier for innovation, competition, and productivity, and many solutions continue to appear, partly supported by the considerable enthusiasm around the MapReduce paradigm for large-scale data analysis. We review various parallel and distributed programming paradigms, analyzing how they fit into the Big Data era, and present modern emerging paradigms and frameworks. To better support practitioners interesting in this domain, we end with an analysis of on-going research challenges towards the truly fourth generation data-intensive science.Peer ReviewedPostprint (author's final draft
Rumble: Data Independence for Large Messy Data Sets
This paper introduces Rumble, an engine that executes JSONiq queries on
large, heterogeneous and nested collections of JSON objects, leveraging the
parallel capabilities of Spark so as to provide a high degree of data
independence. The design is based on two key insights: (i) how to map JSONiq
expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR
clauses to Spark SQL on DataFrames. We have developed a working implementation
of these mappings showing that JSONiq can efficiently run on Spark to query
billions of objects into, at least, the TB range. The JSONiq code is concise in
comparison to Spark's host languages while seamlessly supporting the nested,
heterogeneous data sets that Spark SQL does not. The ability to process this
kind of input, commonly found, is paramount for data cleaning and curation. The
experimental analysis indicates that there is no excessive performance loss,
occasionally even a gain, over Spark SQL for structured data, and a performance
gain over PySpark. This demonstrates that a language such as JSONiq is a simple
and viable approach to large-scale querying of denormalized, heterogeneous,
arborescent data sets, in the same way as SQL can be leveraged for structured
data sets. The results also illustrate that Codd's concept of data independence
makes as much sense for heterogeneous, nested data sets as it does on highly
structured tables.Comment: Preprint, 9 page
Managed Query Processing within the SAP HANA Database Platform
The SAP HANA database extends the scope of traditional database engines as it supports data models beyond regular tables, e.g. text, graphs or hierarchies. Moreover, SAP HANA also provides developers with a more fine-grained control to define their database application logic, e.g. exposing specific operators which are difficult to express in SQL. Finally, the SAP HANA database implements efficient communication to dedicated client applications using more effective communication mechanisms than available with standard interfaces like JDBC or ODBC. These features of the HANA database are complemented by the extended scripting engine–an application server for server-side JavaScript applications–that is tightly integrated into the query processing and application lifecycle management. As a result, the HANA platform offers more concise models and code for working with the HANA platform and provides superior runtime performance. This paper describes how these specific capabilities of the HANA platform can be consumed and gives a holistic overview of the HANA platform starting from query modeling, to the deployment, and efficient execution. As a distinctive feature, the HANA platform integrates most steps of the application lifecycle, and thus makes sure that all relevant artifacts stay consistent whenever they are modified. The HANA platform also covers transport facilities to deploy and undeploy applications in a complex system landscape
- …