7 research outputs found
Functional Collection Programming with Semi-Ring Dictionaries
This paper introduces semi-ring dictionaries, a powerful class of
compositional and purely functional collections that subsume other collection
types such as sets, multisets, arrays, vectors, and matrices. We developed
SDQL, a statically typed language that can express relational algebra with
aggregations, linear algebra, and functional collections over data such as
relations and matrices using semi-ring dictionaries. Furthermore, thanks to the
algebraic structure behind these dictionaries, SDQL unifies a wide range of
optimizations commonly used in databases (DB) and linear algebra (LA). As a
result, SDQL enables efficient processing of hybrid DB and LA workloads, by
putting together optimizations that are otherwise confined to either DB systems
or LA frameworks. We show experimentally that a handful of DB and LA workloads
can take advantage of the SDQL language and optimizations. Overall, we observe
that SDQL achieves competitive performance relative to Typer and Tectorwise,
which are state-of-the-art in-memory DB systems for (flat, not nested)
relational data, and achieves an average 2x speedup over SciPy for LA
workloads. For hybrid workloads involving LA processing, SDQL achieves up to
one order of magnitude speedup over Trance, a state-of-the-art nested
relational engine for nested biomedical data, and gives an average 40% speedup
over LMFAO, a state-of-the-art in-DB machine learning engine for two (flat)
relational real-world retail datasets
A LANGUAGE-BASED APPROACH TO PROGRAMMING WITH SERIALIZED DATA
Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2021In a typical data-processing application, the representation of data in memory is distinct from its representation in a serialized form on disk. The former has pointers and an arbitrary, sparse layout, facilitating easier manipulation by a program, while the latter is packed contiguously, facilitating easier I/O. I propose a programming language, LoCal, that unifies the in-memory and on-disk representations of data. LoCal extends prior work on region calculi into a location calculus, employing a type system that tracks the byte-addressed layout of all heap values. I present the formal semantics of LoCal and prove type safety, and show how to infer LoCal programs from unannotated source terms. Then, I demonstrate how to efficiently implement LoCal in a practical compiler that produces code competitive with hand-written C
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
Compilation and Code Optimization for Data Analytics
The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance.
The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation.
As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive.
The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code