Search CORE

66 research outputs found

Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask

Author: Boncz P.A. (Peter)
Kemper A. (Alfons)
Kersten T. (Timo)
Leis V. (Viktor)
Neumann T. (Thomas)
Pavlo A. (Andrew)
Publication venue
Publication date: 01/01/2018
Field of study

CWI's Institutional Repository

Everything you always wanted to know about compiled and vectorized queries but were afraid to ask

Author: Agarwal S.
Anikiej K.
Boncz P.
Boncz P. A.
Boncz P. A.
Freedman C.
Gubner T.
Kersten T.
Lorie R. A.
Neumann T.
Palkar S.
Palkar S.
Wanderman-Milne S.
Publication venue: 'VLDB Endowment'
Publication date
Field of study

Crossref

Data Management for Data Science - Towards Embedded Analytics

Author: Mühleisen H.F. (Hannes)
Raasveldt M. (Mark)
Publication venue
Publication date: 12/01/2020
Field of study

The rise of Data Science has caused an influx of new usersin need of data management solutions. However, insteadof utilizing existing RDBMS solutions they are opting touse a stack of independent solutions for data storage andprocessing glued together by scripting languages. This is notbecause they do not need the functionality that an integratedRDBMS provides, but rather because existing RDBMS im-plementations do not cater to their use case. To solve theseissues, we propose a new class of data management systems:embedded analytical systems. These systems are tightlyintegrated with analytical tools, and provide fast and effi-cient access to the data stored within them. In this work,we describe the unique challenges and opportunities w.r.tworkloads, resilience and cooperation that are faced by thisnew class of systems and the steps we have taken towardsaddressing them in the DuckDB system

CWI's Institutional Repository

Highlighting the performance diversity of analytical queries using VOILA

Author: Boncz P.A. (Peter)
Gubner T.K. (Tim)
Publication venue
Publication date: 16/08/2021
Field of study

Hardware architecture has long influenced software architecture, and notably so in analytical database systems. Currently, we see a new trend emerging: A "tectonic shift" away from X86-based platforms. Little is (yet) known on how this shift affects database system performance and, consequently, should influence the design choices made. In this paper, we investigate the performance characteristics of X86, POWER, ARM and RISC-V hardware on micro- as well as macro-benchmarks on a variety of analytical database engine designs. Our tool to do so is VOILA: a new database engine generator framework that from a single specification can generate hundreds of different database architecture engines (called "flavors"), among which well-known design points such as vectorized and data-centric execution. We found that performance on different queries by different flavors varies significantly, with no single best flavor overall, and per query different flavors winning, depending on the hardware. We think this "performance diversity" motivates a redesign of existing – inflexible – engines towards hardware- and query-adaptive ones. Additionally, we found that modern ARM platforms can beat X86 in terms of overall performance by up to 2×, provide up to 11.6× lower cost per instance, and up to 4.4× lower cost per query run. This is an early indication that the best days of X86 are over

CWI's Institutional Repository

Analytical Queries: A Comprehensive Survey

Author: Kurapov Petr
Melik-Adamyan Areg
Publication venue
Publication date: 27/11/2023
Field of study

Modern hardware heterogeneity brings efficiency and performance opportunities for analytical query processing. In the presence of continuous data volume and complexity growth, bridging the gap between recent hardware advancements and the data processing tools ecosystem is paramount for improving the speed of ETL and model development. In this paper, we present a comprehensive overview of existing analytical query processing approaches as well as the use and design of systems that use heterogeneous hardware for the task. We then analyze state-of-the-art solutions and identify missing pieces. The last two chapters discuss the identified problems and present our view on how the ecosystem should evolve

arXiv.org e-Print Archive

Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines

Author: Boncz P.A. (Peter)
Kemper A. (Alfons)
Kipf A. (Andreas)
Lang H. (Harald)
Neumann T. (Thomas)
Passing L.K. (Linnea)
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/07/2019
Field of study

Increasing single instruction multiple data (SIMD) capabilities in modern hardware allows for the compilation of data-parallel query pipelines. This means GPU-alike challenges arise: control flow divergence causes the underutilization of vector-processing units. In this paper, we present efficient algorithms for the AVX-512 architecture to address this issue. These algorithms allow for the fine-grained assignment of new tuples to idle SIMD lanes. Furthermore, we present strategies for their integration with compiled query pipelines so that tuples are never evicted from registers. We evaluate our approach with three query types: (i) a table scan query based on TPC-H Query 1, that performs up to 34% faster when addressing underutilization, (ii) a hashjoin query, where we observe up to 25% higher performance, and (iii) an approximate geospatial join query, which shows performance improvements of up to 30%

CWI's Institutional Repository

PyTond:Efficient Python data science on the shoulders of databases

Author: Ghorbani Mahdi
Kaboli Amirali
Shahrokhi Hesam
Shaikhha Amir
Publication venue
Publication date: 23/07/2024
Field of study

Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads

Edinburgh Research Explorer