66 research outputs found
Data Management for Data Science - Towards Embedded Analytics
The rise of Data Science has caused an influx of new usersin need of data management solutions. However, insteadof utilizing existing RDBMS solutions they are opting touse a stack of independent solutions for data storage andprocessing glued together by scripting languages. This is notbecause they do not need the functionality that an integratedRDBMS provides, but rather because existing RDBMS im-plementations do not cater to their use case. To solve theseissues, we propose a new class of data management systems:embedded analytical systems. These systems are tightlyintegrated with analytical tools, and provide fast and effi-cient access to the data stored within them. In this work,we describe the unique challenges and opportunities w.r.tworkloads, resilience and cooperation that are faced by thisnew class of systems and the steps we have taken towardsaddressing them in the DuckDB system
Highlighting the performance diversity of analytical queries using VOILA
Hardware architecture has long influenced software architecture,
and notably so in analytical database systems. Currently, we see
a new trend emerging: A "tectonic shift" away from X86-based
platforms. Little is (yet) known on how this shift affects database
system performance and, consequently, should influence the design
choices made. In this paper, we investigate the performance characteristics of X86, POWER, ARM and RISC-V hardware on micro- as well as macro-benchmarks on a variety of analytical database engine designs. Our tool to do so is VOILA: a new database engine generator framework that from a single specification can generate
hundreds of different database architecture engines (called "flavors"), among which well-known design points such as vectorized and data-centric execution.
We found that performance on different queries by different
flavors varies significantly, with no single best flavor overall, and
per query different flavors winning, depending on the hardware. We
think this "performance diversity" motivates a redesign of existing
– inflexible – engines towards hardware- and query-adaptive ones.
Additionally, we found that modern ARM platforms can beat X86
in terms of overall performance by up to 2×, provide up to 11.6×
lower cost per instance, and up to 4.4× lower cost per query run.
This is an early indication that the best days of X86 are over
Analytical Queries: A Comprehensive Survey
Modern hardware heterogeneity brings efficiency and performance opportunities
for analytical query processing. In the presence of continuous data volume and
complexity growth, bridging the gap between recent hardware advancements and
the data processing tools ecosystem is paramount for improving the speed of ETL
and model development. In this paper, we present a comprehensive overview of
existing analytical query processing approaches as well as the use and design
of systems that use heterogeneous hardware for the task. We then analyze
state-of-the-art solutions and identify missing pieces. The last two chapters
discuss the identified problems and present our view on how the ecosystem
should evolve
Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines
Increasing single instruction multiple data (SIMD) capabilities in modern hardware allows for the compilation of data-parallel query pipelines. This means GPU-alike challenges arise: control flow divergence causes the underutilization of vector-processing units. In this paper, we present efficient algorithms for the AVX-512 architecture to address this issue. These algorithms allow for the fine-grained assignment of new tuples to idle SIMD lanes. Furthermore, we present strategies for their integration with compiled query pipelines so that tuples are never evicted from registers. We evaluate our approach with three query types: (i) a table scan query based on TPC-H Query 1, that performs up to 34% faster when addressing underutilization, (ii) a hashjoin query, where we observe up to 25% higher performance, and (iii) an approximate geospatial join query, which shows performance improvements of up to 30%
PyTond:Efficient Python data science on the shoulders of databases
Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads
- …