Search CORE

60 research outputs found

The MADlib Analytics Library or MAD Skills, the SQL

Author: Feng Xixuan
Fratkin Eugene
Gorajek Aleksander
Hellerstein Joe
Kumar Arun
Li Kun
Ng Kee Siong
Ré Christopher
Schoppmann Florian
Wang Daisy Zhe
Welton Caleb
Publication venue
Publication date: 20/08/2012
Field of study

MADlib is a free, open source library of in-database analytic methods. It provides an evolving suite of SQL-based algorithms for machine learning, data mining and statistics that run at scale within a database engine, with no need for data import/export to other tools. The goal is for MADlib to eventually serve a role for scalable database systems that is similar to the CRAN library for R: a community repository of statistical methods, this time written with scale and parallelism in mind. In this paper we introduce the MADlib project, including the background that led to its beginnings, and the motivation for its open source nature. We provide an overview of the library's architecture and design patterns, and provide a description of various statistical methods in that context. We include performance and speedup results of a core design pattern from one of those methods over the Greenplum parallel DBMS on a modest-sized test cluster. We then report on two initial efforts at incorporating academic research into MADlib, which is one of the project's goals. MADlib is freely available at http://madlib.net, and the project is open for contributions of both new methods, and ports to additional database platforms.Comment: VLDB201

arXiv.org e-Print Archive

Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python

Author: Huber Anna
Kaufmann Michael
Stechschulte Gabriel
Publication venue
Publication date: 31/05/2021
Field of study

Following an analysis of the advantages of SQL-based Machine Learning (ML) and a short literature survey of the field, we describe a novel method for In-Database Machine Learning (IDBML). We contribute a process for SQL-code generation in Python using template macros in Jinja2 as well as the prototype implementation of the process. We describe our implementation of the process to compute multidimensional histogram (MDH) probability estimation in SQL. For this, we contribute and implement a novel discretization method called equal quantized rank binning (EQRB) and equal-width binning (EWB). Based on this, we provide data gathered in a benchmarking experiment for the quantitative empirical evaluation of our method and system using the Covertype dataset. We measured accuracy and computation time and compared it to Scikit Learn state of the art classification algorithms. Using EWB, our multidimensional probability estimation was the fastest of all tested algorithms, while being only 1-2% less accurate than the best state of the art methods found (decision trees and random forests). Our method was significantly more accurate than Naive Bayes, which assumes independent one-dimensional probabilities and/or densities. Also, our method was significantly more accurate and faster than logistic regression. This motivates for further research in accuracy improvement and in IDBML with SQL code generation for big data and larger-than-memory datasets

arXiv.org e-Print Archive

In-RDBMS Hardware Acceleration of Advanced Analytics

Author: Ardalan Adel
Esmaeilzadeh Hadi
Kim Joon Kyung
Kumar Arun
Mahajan Divya
Sacks Jacob
Publication venue: 'VLDB Endowment'
Publication date: 18/09/2018
Field of study

The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in-Database Acceleration of Advanced Analytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Python-embedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders, that directly interface with the buffer pool of the database. Striders extract, cleanse, and process the training data tuples that are consumed by a multi-threaded FPGA engine that executes the analytics algorithm. We integrate DAnA with PostgreSQL to generate hardware accelerators for a range of real-world and synthetic datasets running diverse ML algorithms. Results show that DAnA-enhanced PostgreSQL provides, on average, 8.3x end-to-end speedup for real datasets, with a maximum of 28.2x. Moreover, DAnA-enhanced PostgreSQL is, on average, 4.0x faster than the multi-threaded Apache MADLib running on Greenplum. DAnA provides these benefits while hiding the complexity of hardware design from data scientists and allowing them to express the algorithm in =30-60 lines of Python

arXiv.org e-Print Archive

AC/DC: In-Database Learning Thunderstruck

Author: Khamis Mahmoud Abo
Ngo Hung Q.
Nguyen XuanLong
Olteanu Dan
Schleich Maximilian
Publication venue
Publication date: 15/06/2018
Field of study

We report on the design and implementation of the AC/DC gradient descent solver for a class of optimization problems over normalized databases. AC/DC decomposes an optimization problem into a set of aggregates over the join of the database relations. It then uses the answers to these aggregates to iteratively improve the solution to the problem until it converges. The challenges faced by AC/DC are the large database size, the mixture of continuous and categorical features, and the large number of aggregates to compute. AC/DC addresses these challenges by employing a sparse data representation, factorized computation, problem reparameterization under functional dependencies, and a data structure that supports shared computation of aggregates. To train polynomial regression models and factorization machines of up to 154K features over the natural join of all relations from a real-world dataset of up to 86M tuples, AC/DC needs up to 30 minutes on one core of a commodity machine. This is up to three orders of magnitude faster than its competitors R, MadLib, libFM, and TensorFlow whenever they finish and thus do not exceed memory limitation, 24-hour timeout, or internal design limitations.Comment: 10 pages, 3 figure

arXiv.org e-Print Archive

Streamlining Smart Meter Data Analytics

Author: Liu Xiufeng
Nielsen Per Sieverts
Publication venue: International Centre for Sustainable Development of Energy, Water and Environment Systems
Publication date: 01/01/2015
Field of study

pgFMU:Integrating Data Management with Physical System Modelling

Author: Neupane Bijay
Pedersen Torben Bach
Rybnytska Olga
Siksnys Laurynas
Publication venue: OpenProceedings.org
Publication date: 30/03/2020
Field of study

VBN

Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service

Author: Elshawi Radwa
Sakr Sherif
Publication venue
Publication date: 21/09/2017
Field of study

Recently, we have been witnessing huge advancements in the scale of data we routinely generate and collect in pretty much everything we do, as well as our ability to exploit modern technologies to process, analyze and understand this data. The intersection of these trends is what is called, nowadays, as Big Data Science. Cloud computing represents a practical and cost-effective solution for supporting Big Data storage, processing and for sophisticated analytics applications. We analyze in details the building blocks of the software stack for supporting big data science as a commodity service for data scientists. We provide various insights about the latest ongoing developments and open challenges in this domain

arXiv.org e-Print Archive

SQL for SRL: Structure Learning Inside a Database System

Author: Qian Zhensong
Schulte Oliver
Publication venue
Publication date: 02/07/2015
Field of study

The position we advocate in this paper is that relational algebra can provide a unified language for both representing and computing with statistical-relational objects, much as linear algebra does for traditional single-table machine learning. Relational algebra is implemented in the Structured Query Language (SQL), which is the basis of relational database management systems. To support our position, we have developed the FACTORBASE system, which uses SQL as a high-level scripting language for statistical-relational learning of a graphical model structure. The design philosophy of FACTORBASE is to manage statistical models as first-class citizens inside a database. Our implementation shows how our SQL constructs in FACTORBASE facilitate fast, modular, and reliable program development. Empirical evidence from six benchmark databases indicates that leveraging database system capabilities achieves scalable model structure learning.Comment: 3 pages, 1 figure, Position Paper of the Fifth International Workshop on Statistical Relational AI at UAI 201

arXiv.org e-Print Archive

Declarative Data Analytics: a Survey

Author: Makrynioti Nantia
Vassalos Vasilis
Publication venue
Publication date: 04/02/2019
Field of study

The area of declarative data analytics explores the application of the declarative paradigm on data science and machine learning. It proposes declarative languages for expressing data analysis tasks and develops systems which optimize programs written in those languages. The execution engine can be either centralized or distributed, as the declarative paradigm advocates independence from particular physical implementations. The survey explores a wide range of declarative data analysis frameworks by examining both the programming model and the optimization techniques used, in order to provide conclusions on the current state of the art in the area and identify open challenges.Comment: 36 pages, 2 figure

arXiv.org e-Print Archive

FactorBase: SQL for Learning A Multi-Relational Graphical Model

Author: Qian Zhensong
Schulte Oliver
Publication venue
Publication date: 10/08/2015
Field of study

We describe FactorBase, a new SQL-based framework that leverages a relational database management system to support multi-relational model discovery. A multi-relational statistical model provides an integrated analysis of the heterogeneous and interdependent data resources in the database. We adopt the BayesStore design philosophy: statistical models are stored and managed as first-class citizens inside a database. Whereas previous systems like BayesStore support multi-relational inference, FactorBase supports multi-relational learning. A case study on six benchmark databases evaluates how our system supports a challenging machine learning application, namely learning a first-order Bayesian network model for an entire database. Model learning in this setting has to examine a large number of potential statistical associations across data tables. Our implementation shows how the SQL constructs in FactorBase facilitate the fast, modular, and reliable development of highly scalable model learning systems.Comment: 14 pages, 10 figures, 10 tables, Published on 2015 IEEE International Conference on Data Science and Advanced Analytics (IEEE DSAA'2015), Oct 19-21, 2015, Paris, Franc

arXiv.org e-Print Archive