4 research outputs found

    Analytics for Everyone

    Get PDF
    Analyzing relational data typically involves tasks that facilitate gaining familiarity or insights and coming up with findings or conclusions based on the data. This process is usually practiced by data experts, such as data scientists, who share their output with a potentially less expert audience (everyone). Our goal is to enable everyone to participate in analyzing data rather than passively consuming its outputs (analytics democratization). With today’s increasing availability of data (data democratization) on the internet (web) combined with already widespread personal computing capabilities such a goal is becoming more attainable. With the recent increase of public data, i.e., Open Data, users without a technical background are keener than ever to analyze new data sets that are relevant to wide sectors of society. An important example of Open Data is the data released by governments all over the world, i.e., Open Government. This dissertation focuses on two main challenges that would face data exploration scenarios such as exploring open data found over the web. First, the infrastructure necessary for interactive data exploration is costly and hard to manage, especially by users who do not have technical knowledge. Second, the target users need guidance through the data exploration since there are too many starting points. To eliminate challenges related to managing infrastructure, we propose an in-browser SQL engine (serverless), i.e., a portable database, which we call Afterburner. Afterburner achieves comparable performance to native SQL engines given the same resources on modestly sized data sets. Afterburner uses code generation techniques that target an optimization-amenable subset of JavaScript and employs typed arrays for its columnar-based in-memory storage. In addition, for databases that are too large for the browser, we propose a hybrid architecture to accelerate the performance of data exploration tasks: a one-time SQL query that runs at the backend and SQL queries running in the browser as per user’s interactions. Based on a simple hint by the user, Afterburner automatically splits queries into two parts: a backend query that generates a materialized view that is shipped to the browser, and a frontend query per subsequent interaction occur locally against this view. Optimizing queries using local materialized views inside the browser accelerates query latency without adding any complexity to the backend or the frontend. One common theme among many data exploration tasks revolves around navigating the many different ways to group the data, i.e., exploring the data cube. Thus, to guide the user through data exploration, we apply an information-theoretic technique that picks the most informative parts from the entire data cube of a relational table, which is called Explanation Tables. We evaluate the efficiency and effectiveness of a sampling-based technique for generating explanation tables that achieves comparable quality to an exhaustive technique that considers the entire data cube, with a significant reduction in the run time. In addition, we introduce optimizations to explanation tables to fit the modest resources available in the browser without any external dependencies. In this, we present an SQL engine and a data exploration guidance tool that run entirely in the browser. We view the techniques and the experiments presented here as a fully functional and open-sourced proof of viability of our proposal. Our analytical stack is portable and works entirely in the browser. We show that SQL and exploration guidance can be as accessible as a web page, which opens the opportunity for more people to analyze data sets. Facilitating data exploration for everyone is one step closer towards analytics democratization where everyone can participate in data exploration, not just the experts

    Compilation and Code Optimization for Data Analytics

    Get PDF
    The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance. The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation. As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive. The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code

    Compilation and Code Optimization for Data Analytics

    Get PDF
    The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance. The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation. As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive. The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code
    corecore