66 research outputs found

    Data Management for Data Science - Towards Embedded Analytics

    Get PDF
    The rise of Data Science has caused an influx of new usersin need of data management solutions. However, insteadof utilizing existing RDBMS solutions they are opting touse a stack of independent solutions for data storage andprocessing glued together by scripting languages. This is notbecause they do not need the functionality that an integratedRDBMS provides, but rather because existing RDBMS im-plementations do not cater to their use case. To solve theseissues, we propose a new class of data management systems:embedded analytical systems. These systems are tightlyintegrated with analytical tools, and provide fast and effi-cient access to the data stored within them. In this work,we describe the unique challenges and opportunities w.r.tworkloads, resilience and cooperation that are faced by thisnew class of systems and the steps we have taken towardsaddressing them in the DuckDB system

    Highlighting the performance diversity of analytical queries using VOILA

    Get PDF
    Hardware architecture has long influenced software architecture, and notably so in analytical database systems. Currently, we see a new trend emerging: A "tectonic shift" away from X86-based platforms. Little is (yet) known on how this shift affects database system performance and, consequently, should influence the design choices made. In this paper, we investigate the performance characteristics of X86, POWER, ARM and RISC-V hardware on micro- as well as macro-benchmarks on a variety of analytical database engine designs. Our tool to do so is VOILA: a new database engine generator framework that from a single specification can generate hundreds of different database architecture engines (called "flavors"), among which well-known design points such as vectorized and data-centric execution. We found that performance on different queries by different flavors varies significantly, with no single best flavor overall, and per query different flavors winning, depending on the hardware. We think this "performance diversity" motivates a redesign of existing – inflexible – engines towards hardware- and query-adaptive ones. Additionally, we found that modern ARM platforms can beat X86 in terms of overall performance by up to 2×, provide up to 11.6× lower cost per instance, and up to 4.4× lower cost per query run. This is an early indication that the best days of X86 are over

    Analytical Queries: A Comprehensive Survey

    Full text link
    Modern hardware heterogeneity brings efficiency and performance opportunities for analytical query processing. In the presence of continuous data volume and complexity growth, bridging the gap between recent hardware advancements and the data processing tools ecosystem is paramount for improving the speed of ETL and model development. In this paper, we present a comprehensive overview of existing analytical query processing approaches as well as the use and design of systems that use heterogeneous hardware for the task. We then analyze state-of-the-art solutions and identify missing pieces. The last two chapters discuss the identified problems and present our view on how the ecosystem should evolve

    Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines

    Get PDF
    Increasing single instruction multiple data (SIMD) capabilities in modern hardware allows for the compilation of data-parallel query pipelines. This means GPU-alike challenges arise: control flow divergence causes the underutilization of vector-processing units. In this paper, we present efficient algorithms for the AVX-512 architecture to address this issue. These algorithms allow for the fine-grained assignment of new tuples to idle SIMD lanes. Furthermore, we present strategies for their integration with compiled query pipelines so that tuples are never evicted from registers. We evaluate our approach with three query types: (i) a table scan query based on TPC-H Query 1, that performs up to 34% faster when addressing underutilization, (ii) a hashjoin query, where we observe up to 25% higher performance, and (iii) an approximate geospatial join query, which shows performance improvements of up to 30%

    PyTond:Efficient Python data science on the shoulders of databases

    Get PDF
    Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads
    • …
    corecore