60 research outputs found
The MADlib Analytics Library or MAD Skills, the SQL
MADlib is a free, open source library of in-database analytic methods. It
provides an evolving suite of SQL-based algorithms for machine learning, data
mining and statistics that run at scale within a database engine, with no need
for data import/export to other tools. The goal is for MADlib to eventually
serve a role for scalable database systems that is similar to the CRAN library
for R: a community repository of statistical methods, this time written with
scale and parallelism in mind. In this paper we introduce the MADlib project,
including the background that led to its beginnings, and the motivation for its
open source nature. We provide an overview of the library's architecture and
design patterns, and provide a description of various statistical methods in
that context. We include performance and speedup results of a core design
pattern from one of those methods over the Greenplum parallel DBMS on a
modest-sized test cluster. We then report on two initial efforts at
incorporating academic research into MADlib, which is one of the project's
goals. MADlib is freely available at http://madlib.net, and the project is open
for contributions of both new methods, and ports to additional database
platforms.Comment: VLDB201
Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python
Following an analysis of the advantages of SQL-based Machine Learning (ML)
and a short literature survey of the field, we describe a novel method for
In-Database Machine Learning (IDBML). We contribute a process for SQL-code
generation in Python using template macros in Jinja2 as well as the prototype
implementation of the process. We describe our implementation of the process to
compute multidimensional histogram (MDH) probability estimation in SQL. For
this, we contribute and implement a novel discretization method called equal
quantized rank binning (EQRB) and equal-width binning (EWB). Based on this, we
provide data gathered in a benchmarking experiment for the quantitative
empirical evaluation of our method and system using the Covertype dataset. We
measured accuracy and computation time and compared it to Scikit Learn state of
the art classification algorithms. Using EWB, our multidimensional probability
estimation was the fastest of all tested algorithms, while being only 1-2% less
accurate than the best state of the art methods found (decision trees and
random forests). Our method was significantly more accurate than Naive Bayes,
which assumes independent one-dimensional probabilities and/or densities. Also,
our method was significantly more accurate and faster than logistic regression.
This motivates for further research in accuracy improvement and in IDBML with
SQL code generation for big data and larger-than-memory datasets
In-RDBMS Hardware Acceleration of Advanced Analytics
The data revolution is fueled by advances in machine learning, databases, and
hardware design. Programmable accelerators are making their way into each of
these areas independently. As such, there is a void of solutions that enables
hardware acceleration at the intersection of these disjoint fields. This paper
sets out to be the initial step towards a unifying solution for in-Database
Acceleration of Advanced Analytics (DAnA). Deploying specialized hardware, such
as FPGAs, for in-database analytics currently requires hand-designing the
hardware and manually routing the data. Instead, DAnA automatically maps a
high-level specification of advanced analytics queries to an FPGA accelerator.
The accelerator implementation is generated for a User Defined Function (UDF),
expressed as a part of an SQL query using a Python-embedded Domain-Specific
Language (DSL). To realize an efficient in-database integration, DAnA
accelerators contain a novel hardware structure, Striders, that directly
interface with the buffer pool of the database. Striders extract, cleanse, and
process the training data tuples that are consumed by a multi-threaded FPGA
engine that executes the analytics algorithm. We integrate DAnA with PostgreSQL
to generate hardware accelerators for a range of real-world and synthetic
datasets running diverse ML algorithms. Results show that DAnA-enhanced
PostgreSQL provides, on average, 8.3x end-to-end speedup for real datasets,
with a maximum of 28.2x. Moreover, DAnA-enhanced PostgreSQL is, on average,
4.0x faster than the multi-threaded Apache MADLib running on Greenplum. DAnA
provides these benefits while hiding the complexity of hardware design from
data scientists and allowing them to express the algorithm in =30-60 lines of
Python
AC/DC: In-Database Learning Thunderstruck
We report on the design and implementation of the AC/DC gradient descent
solver for a class of optimization problems over normalized databases. AC/DC
decomposes an optimization problem into a set of aggregates over the join of
the database relations. It then uses the answers to these aggregates to
iteratively improve the solution to the problem until it converges.
The challenges faced by AC/DC are the large database size, the mixture of
continuous and categorical features, and the large number of aggregates to
compute. AC/DC addresses these challenges by employing a sparse data
representation, factorized computation, problem reparameterization under
functional dependencies, and a data structure that supports shared computation
of aggregates.
To train polynomial regression models and factorization machines of up to
154K features over the natural join of all relations from a real-world dataset
of up to 86M tuples, AC/DC needs up to 30 minutes on one core of a commodity
machine. This is up to three orders of magnitude faster than its competitors R,
MadLib, libFM, and TensorFlow whenever they finish and thus do not exceed
memory limitation, 24-hour timeout, or internal design limitations.Comment: 10 pages, 3 figure
Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service
Recently, we have been witnessing huge advancements in the scale of data we
routinely generate and collect in pretty much everything we do, as well as our
ability to exploit modern technologies to process, analyze and understand this
data. The intersection of these trends is what is called, nowadays, as Big Data
Science. Cloud computing represents a practical and cost-effective solution for
supporting Big Data storage, processing and for sophisticated analytics
applications. We analyze in details the building blocks of the software stack
for supporting big data science as a commodity service for data scientists. We
provide various insights about the latest ongoing developments and open
challenges in this domain
SQL for SRL: Structure Learning Inside a Database System
The position we advocate in this paper is that relational algebra can provide
a unified language for both representing and computing with
statistical-relational objects, much as linear algebra does for traditional
single-table machine learning. Relational algebra is implemented in the
Structured Query Language (SQL), which is the basis of relational database
management systems. To support our position, we have developed the FACTORBASE
system, which uses SQL as a high-level scripting language for
statistical-relational learning of a graphical model structure. The design
philosophy of FACTORBASE is to manage statistical models as first-class
citizens inside a database. Our implementation shows how our SQL constructs in
FACTORBASE facilitate fast, modular, and reliable program development.
Empirical evidence from six benchmark databases indicates that leveraging
database system capabilities achieves scalable model structure learning.Comment: 3 pages, 1 figure, Position Paper of the Fifth International Workshop
on Statistical Relational AI at UAI 201
Declarative Data Analytics: a Survey
The area of declarative data analytics explores the application of the
declarative paradigm on data science and machine learning. It proposes
declarative languages for expressing data analysis tasks and develops systems
which optimize programs written in those languages. The execution engine can be
either centralized or distributed, as the declarative paradigm advocates
independence from particular physical implementations. The survey explores a
wide range of declarative data analysis frameworks by examining both the
programming model and the optimization techniques used, in order to provide
conclusions on the current state of the art in the area and identify open
challenges.Comment: 36 pages, 2 figure
FactorBase: SQL for Learning A Multi-Relational Graphical Model
We describe FactorBase, a new SQL-based framework that leverages a relational
database management system to support multi-relational model discovery. A
multi-relational statistical model provides an integrated analysis of the
heterogeneous and interdependent data resources in the database. We adopt the
BayesStore design philosophy: statistical models are stored and managed as
first-class citizens inside a database. Whereas previous systems like
BayesStore support multi-relational inference, FactorBase supports
multi-relational learning. A case study on six benchmark databases evaluates
how our system supports a challenging machine learning application, namely
learning a first-order Bayesian network model for an entire database. Model
learning in this setting has to examine a large number of potential statistical
associations across data tables. Our implementation shows how the SQL
constructs in FactorBase facilitate the fast, modular, and reliable development
of highly scalable model learning systems.Comment: 14 pages, 10 figures, 10 tables, Published on 2015 IEEE International
Conference on Data Science and Advanced Analytics (IEEE DSAA'2015), Oct
19-21, 2015, Paris, Franc
- …