146 research outputs found
Tupleware: Redefining Modern Analytics
There is a fundamental discrepancy between the targeted and actual users of
current analytics frameworks. Most systems are designed for the data and
infrastructure of the Googles and Facebooks of the world---petabytes of data
distributed across large cloud deployments consisting of thousands of cheap
commodity machines. Yet, the vast majority of users operate clusters ranging
from a few to a few dozen nodes, analyze relatively small datasets of up to a
few terabytes, and perform primarily compute-intensive operations. Targeting
these users fundamentally changes the way we should build analytics systems.
This paper describes the design of Tupleware, a new system specifically aimed
at the challenges faced by the typical user. Tupleware's architecture brings
together ideas from the database, compiler, and programming languages
communities to create a powerful end-to-end solution for data analysis. We
propose novel techniques that consider the data, computations, and hardware
together to achieve maximum performance on a case-by-case basis. Our
experimental evaluation quantifies the impact of our novel techniques and shows
orders of magnitude performance improvement over alternative systems
ArcaDB: A Container-based Disaggregated Query Engine for Heterogenous Computational Environments
Modern enterprises rely on data management systems to collect, store, and
analyze vast amounts of data related with their operations. Nowadays, clusters
and hardware accelerators (e.g., GPUs, TPUs) have become a necessity to scale
with the data processing demands in many applications related to social media,
bioinformatics, surveillance systems, remote sensing, and medical informatics.
Given this new scenario, the architecture of data analytics engines must evolve
to take advantage of these new technological trends. In this paper, we present
ArcaDB: a disaggregated query engine that leverages container technology to
place operators at compute nodes that fit their performance profile. In ArcaDB,
a query plan is dispatched to worker nodes that have different computing
characteristics. Each operator is annotated with the preferred type of compute
node for execution, and ArcaDB ensures that the operator gets picked up by the
appropriate workers. We have implemented a prototype version of ArcaDB using
Java, Python, and Docker containers. We have also completed a preliminary
performance study of this prototype, using images and scientific data. This
study shows that ArcaDB can speed up query performance by a factor of 3.5x in
comparison with a shared-nothing, symmetric arrangement
Formal Representation of the SS-DB Benchmark and Experimental Evaluation in EXTASCID
Evaluating the performance of scientific data processing systems is a
difficult task considering the plethora of application-specific solutions
available in this landscape and the lack of a generally-accepted benchmark. The
dual structure of scientific data coupled with the complex nature of processing
complicate the evaluation procedure further. SS-DB is the first attempt to
define a general benchmark for complex scientific processing over raw and
derived data. It fails to draw sufficient attention though because of the
ambiguous plain language specification and the extraordinary SciDB results. In
this paper, we remedy the shortcomings of the original SS-DB specification by
providing a formal representation in terms of ArrayQL algebra operators and
ArrayQL/SciQL constructs. These are the first formal representations of the
SS-DB benchmark. Starting from the formal representation, we give a reference
implementation and present benchmark results in EXTASCID, a novel system for
scientific data processing. EXTASCID is complete in providing native support
both for array and relational data and extensible in executing any user code
inside the system by the means of a configurable metaoperator. These features
result in an order of magnitude improvement over SciDB at data loading,
extracting derived data, and operations over derived data.Comment: 32 pages, 3 figure
Declarative Data Analytics: a Survey
The area of declarative data analytics explores the application of the
declarative paradigm on data science and machine learning. It proposes
declarative languages for expressing data analysis tasks and develops systems
which optimize programs written in those languages. The execution engine can be
either centralized or distributed, as the declarative paradigm advocates
independence from particular physical implementations. The survey explores a
wide range of declarative data analysis frameworks by examining both the
programming model and the optimization techniques used, in order to provide
conclusions on the current state of the art in the area and identify open
challenges.Comment: 36 pages, 2 figure
- …