1,549 research outputs found
Empowering In-Memory Relational Database Engines with Native Graph Processing
The plethora of graphs and relational data give rise to many interesting
graph-relational queries in various domains, e.g., finding related proteins
satisfying relational predicates in a biological network. The maturity of
RDBMSs motivated academia and industry to invest efforts in leveraging RDBMSs
for graph processing, where efficiency is proven for vital graph queries.
However, none of these efforts process graphs natively inside the RDBMS, which
is particularly challenging due to the impedance mismatch between the
relational and the graph models. In this paper, we propose to treat graphs as
first-class citizens inside the relational engine so that operations on graphs
are executed natively inside the RDBMS. We realize our approach inside VoltDB,
an open-source in-memory relational database, and name this realization
GRFusion. The SQL and the query engine of GRFusion are empowered to
declaratively define graphs and execute cross-data-model query plans formed by
graph and relational operators, resulting in up to four orders-of-magnitude in
query-time speedup w.r.t. state-of-the-art approaches
Declarative Data Analytics: a Survey
The area of declarative data analytics explores the application of the
declarative paradigm on data science and machine learning. It proposes
declarative languages for expressing data analysis tasks and develops systems
which optimize programs written in those languages. The execution engine can be
either centralized or distributed, as the declarative paradigm advocates
independence from particular physical implementations. The survey explores a
wide range of declarative data analysis frameworks by examining both the
programming model and the optimization techniques used, in order to provide
conclusions on the current state of the art in the area and identify open
challenges.Comment: 36 pages, 2 figure
The Family of MapReduce and Large Scale Data Processing Systems
In the last two decades, the continuous increase of computational power has
produced an overwhelming flow of data which has called for a paradigm shift in
the computing architecture and large scale data processing mechanisms.
MapReduce is a simple and powerful programming model that enables easy
development of scalable parallel applications to process vast amounts of data
on large clusters of commodity machines. It isolates the application from the
details of running a distributed program such as issues on data distribution,
scheduling and fault tolerance. However, the original implementation of the
MapReduce framework had some limitations that have been tackled by many
research efforts in several followup works after its introduction. This article
provides a comprehensive survey for a family of approaches and mechanisms of
large scale data processing mechanisms that have been implemented based on the
original idea of the MapReduce framework and are currently gaining a lot of
momentum in both research and industrial communities. We also cover a set of
introduced systems that have been implemented to provide declarative
programming interfaces on top of the MapReduce framework. In addition, we
review several large scale data processing systems that resemble some of the
ideas of the MapReduce framework for different purposes and application
scenarios. Finally, we discuss some of the future research directions for
implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
Scaling Datalog for Machine Learning on Big Data
In this paper, we present the case for a declarative foundation for
data-intensive machine learning systems. Instead of creating a new system for
each specific flavor of machine learning task, or hardcoding new optimizations,
we argue for the use of recursive queries to program a variety of machine
learning systems. By taking this approach, database query optimization
techniques can be utilized to identify effective execution plans, and the
resulting runtime plans can be executed on a single unified data-parallel query
processing engine. As a proof of concept, we consider two programming
models--Pregel and Iterative Map-Reduce-Update---from the machine learning
domain, and show how they can be captured in Datalog, tuned for a specific
task, and then compiled into an optimized physical plan. Experiments performed
on a large computing cluster with real data demonstrate that this declarative
approach can provide very good performance while offering both increased
generality and programming ease
Scientific Workflows and Provenance: Introduction and Research Opportunities
Scientific workflows are becoming increasingly popular for compute-intensive
and data-intensive scientific applications. The vision and promise of
scientific workflows includes rapid, easy workflow design, reuse, scalable
execution, and other advantages, e.g., to facilitate "reproducible science"
through provenance (e.g., data lineage) support. However, as described in the
paper, important research challenges remain. While the database community has
studied (business) workflow technologies extensively in the past, most current
work in scientific workflows seems to be done outside of the database
community, e.g., by practitioners and researchers in the computational sciences
and eScience. We provide a brief introduction to scientific workflows and
provenance, and identify areas and problems that suggest new opportunities for
database research.Comment: 12 pages, 2 figure
SOFA: An Extensible Logical Optimizer for UDF-heavy Dataflows
Recent years have seen an increased interest in large-scale analytical
dataflows on non-relational data. These dataflows are compiled into execution
graphs scheduled on large compute clusters. In many novel application areas the
predominant building blocks of such dataflows are user-defined predicates or
functions (UDFs). However, the heavy use of UDFs is not well taken into account
for dataflow optimization in current systems.
SOFA is a novel and extensible optimizer for UDF-heavy dataflows. It builds
on a concise set of properties for describing the semantics of Map/Reduce-style
UDFs and a small set of rewrite rules, which use these properties to find a
much larger number of semantically equivalent plan rewrites than possible with
traditional techniques. A salient feature of our approach is extensibility: We
arrange user-defined operators and their properties into a subsumption
hierarchy, which considerably eases integration and optimization of new
operators. We evaluate SOFA on a selection of UDF-heavy dataflows from
different domains and compare its performance to three other algorithms for
dataflow optimization. Our experiments reveal that SOFA finds efficient plans,
outperforming the best plans found by its competitors by a factor of up to 6
Information Integration and Computational Logic
Information Integration is a young and exciting field with enormous research
and commercial significance in the new world of the Information Society. It
stands at the crossroad of Databases and Artificial Intelligence requiring
novel techniques that bring together different methods from these fields.
Information from disparate heterogeneous sources often with no a-priori common
schema needs to be synthesized in a flexible, transparent and intelligent way
in order to respond to the demands of a query thus enabling a more informed
decision by the user or application program. The field although relatively
young has already found many practical applications particularly for
integrating information over the World Wide Web. This paper gives a brief
introduction of the field highlighting some of the main current and future
research issues and application areas. It attempts to evaluate the current and
potential role of Computational Logic in this and suggests some of the problems
where logic-based techniques could be used.Comment: 53 Page
Big Data Systems Meet Machine Learning Challenges: Towards Big Data Science as a Service
Recently, we have been witnessing huge advancements in the scale of data we
routinely generate and collect in pretty much everything we do, as well as our
ability to exploit modern technologies to process, analyze and understand this
data. The intersection of these trends is what is called, nowadays, as Big Data
Science. Cloud computing represents a practical and cost-effective solution for
supporting Big Data storage, processing and for sophisticated analytics
applications. We analyze in details the building blocks of the software stack
for supporting big data science as a commodity service for data scientists. We
provide various insights about the latest ongoing developments and open
challenges in this domain
Unicast and Multicast Qos Routing with Soft Constraint Logic Programming
We present a formal model to represent and solve the unicast/multicast
routing problem in networks with Quality of Service (QoS) requirements. To
attain this, first we translate the network adapting it to a weighted graph
(unicast) or and-or graph (multicast), where the weight on a connector
corresponds to the multidimensional cost of sending a packet on the related
network link: each component of the weights vector represents a different QoS
metric value (e.g. bandwidth, cost, delay, packet loss). The second step
consists in writing this graph as a program in Soft Constraint Logic
Programming (SCLP): the engine of this framework is then able to find the best
paths/trees by optimizing their costs and solving the constraints imposed on
them (e.g. delay < 40msec), thus finding a solution to QoS routing problems.
Moreover, c-semiring structures are a convenient tool to model QoS metrics. At
last, we provide an implementation of the framework over scale-free networks
and we suggest how the performance can be improved.Comment: 45 page
AppLP: A Dialogue on Applications of Logic Programming
This document describes the contributions of the 2016 Applications of Logic
Programming Workshop (AppLP), which was held on October 17 and associated with
the International Conference on Logic Programming (ICLP) in Flushing, New York
City.Comment: David S. Warren and Yanhong A. Liu (Editors). 33 pages. Including
summaries by Christopher Kane and abstracts or position papers by M. Aref, J.
Rosenwald, I. Cervesato, E.S.L. Lam, M. Balduccini, J. Lobo, A. Russo, E.
Lupu, N. Leone, F. Ricca, G. Gupta, K. Marple, E. Salazar, Z. Chen, A. Sobhi,
S. Srirangapalli, C.R. Ramakrishnan, N. Bj{\o}rner, N.P. Lopes, A.
Rybalchenko, and P. Tara
- …