5 research outputs found
Machine Learning over Static and Dynamic Relational Data
This tutorial overviews principles behind recent works on training and
maintaining machine learning models over relational data, with an emphasis on
the exploitation of the relational data structure to improve the runtime
performance of the learning task.
The tutorial has the following parts:
1) Database research for data science
2) Three main ideas to achieve performance improvements
2.1) Turn the ML problem into a DB problem
2.2) Exploit structure of the data and problem
2.3) Exploit engineering tools of a DB researcher
3) Avenues for future researchComment: arXiv admin note: text overlap with arXiv:2008.0786
PyTond:Efficient Python data science on the shoulders of databases
Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads
Functional Collection Programming with Semi-Ring Dictionaries
This paper introduces semi-ring dictionaries, a powerful class of
compositional and purely functional collections that subsume other collection
types such as sets, multisets, arrays, vectors, and matrices. We developed
SDQL, a statically typed language that can express relational algebra with
aggregations, linear algebra, and functional collections over data such as
relations and matrices using semi-ring dictionaries. Furthermore, thanks to the
algebraic structure behind these dictionaries, SDQL unifies a wide range of
optimizations commonly used in databases (DB) and linear algebra (LA). As a
result, SDQL enables efficient processing of hybrid DB and LA workloads, by
putting together optimizations that are otherwise confined to either DB systems
or LA frameworks. We show experimentally that a handful of DB and LA workloads
can take advantage of the SDQL language and optimizations. Overall, we observe
that SDQL achieves competitive performance relative to Typer and Tectorwise,
which are state-of-the-art in-memory DB systems for (flat, not nested)
relational data, and achieves an average 2x speedup over SciPy for LA
workloads. For hybrid workloads involving LA processing, SDQL achieves up to
one order of magnitude speedup over Trance, a state-of-the-art nested
relational engine for nested biomedical data, and gives an average 40% speedup
over LMFAO, a state-of-the-art in-DB machine learning engine for two (flat)
relational real-world retail datasets