Search CORE

5 research outputs found

Machine Learning over Static and Dynamic Relational Data

Author: Kara Ahmet
Nikolic Milos
Olteanu Dan
Zhang Haozhe
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/06/2021
Field of study

This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1) Turn the ML problem into a DB problem 2.2) Exploit structure of the data and problem 2.3) Exploit engineering tools of a DB researcher 3) Avenues for future researchComment: arXiv admin note: text overlap with arXiv:2008.0786

arXiv.org e-Print Archive

Edinburgh Research Explorer

PyTond:Efficient Python data science on the shoulders of databases

Author: Ghorbani Mahdi
Kaboli Amirali
Shahrokhi Hesam
Shaikhha Amir
Publication venue
Publication date: 23/07/2024
Field of study

Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads

Edinburgh Research Explorer

Functional Collection Programming with Semi-Ring Dictionaries

Author: Huot Mathieu
Olteanu Dan
Shaikhha Amir
Smith Jaclyn
Publication venue
Publication date: 13/10/2021
Field of study

This paper introduces semi-ring dictionaries, a powerful class of compositional and purely functional collections that subsume other collection types such as sets, multisets, arrays, vectors, and matrices. We developed SDQL, a statically typed language that can express relational algebra with aggregations, linear algebra, and functional collections over data such as relations and matrices using semi-ring dictionaries. Furthermore, thanks to the algebraic structure behind these dictionaries, SDQL unifies a wide range of optimizations commonly used in databases (DB) and linear algebra (LA). As a result, SDQL enables efficient processing of hybrid DB and LA workloads, by putting together optimizations that are otherwise confined to either DB systems or LA frameworks. We show experimentally that a handful of DB and LA workloads can take advantage of the SDQL language and optimizations. Overall, we observe that SDQL achieves competitive performance relative to Typer and Tectorwise, which are state-of-the-art in-memory DB systems for (flat, not nested) relational data, and achieves an average 2x speedup over SciPy for LA workloads. For hybrid workloads involving LA processing, SDQL achieves up to one order of magnitude speedup over Trance, a state-of-the-art nested relational engine for nested biomedical data, and gives an average 40% speedup over LMFAO, a state-of-the-art in-DB machine learning engine for two (flat) relational real-world retail datasets

arXiv.org e-Print Archive

Edinburgh Research Explorer