68,392 research outputs found
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
NOSQL design for analytical workloads: Variability matters
Big Data has recently gained popularity and has strongly questioned relational databases as universal storage systems, especially in the presence of analytical workloads. As result, co-relational alternatives, commonly known as NOSQL (Not Only SQL) databases, are extensively used for Big Data. As the primary focus of NOSQL is on performance, NOSQL databases are directly designed at the physical level, and consequently the resulting schema is tailored to the dataset and access patterns of the problem in hand. However, we believe that NOSQL design can also benefit from traditional design approaches. In this paper we present a method to design databases for analytical workloads. Starting from the conceptual model and adopting the classical 3-phase design used for relational databases, we propose a novel design method considering the new features brought by NOSQL and encompassing relational and co-relational design altogether.Peer ReviewedPostprint (author's final draft
Recommended from our members
The P3 platform: an approach and software system for developing diagrammatic model-based methods in design research
Many issues in design and design management have been explored by building models which capture the relationships between different aspects of the problem at hand. These models require computer support to construct and analyse. However, appropriate modelling tools can be time-consuming to develop in a research environment. Reflecting upon five design research projects, this paper proposes that such projects can be facilitated by recognising the iterative and tightly-coupled nature of research and tool development, and by attempting to minimise the effort of solution prototyping within this process. Our approach is enabled by a software platform which can be rapidly configured to implement many conceivable modelling approaches. This configurability is complemented by an emerging library of modelling and analysis approaches tailored to explore design process systems. The platform-based approach enables any mix of modelling concepts to be easily created. We propose it could thus help researchers to explore a wide range of questions without being constrained to existing conventions for modelling – or for model integration
Recommended from our members
Automatic view schema generation in object-oriented databases
An object-oriented data schema is a complex structure of classes interrelated via generalization and property decomposition relationships. We define an object-oriented view to be a virtual schema graph with possibly restructured generalization and decomposition hierarchies - rather than just one individual virtual class as proposed in the literature. In this paper, we propose a methodology, called MultiView, for supporting multiple such view schemata. MultiView is anchored on the following complementary ideas: (a) the view definer derives virtual classes and then integrates them into one consistent global schema graph and (b) the view definer specifies arbitrarily complex view schemata on this augmented global schema. The focus of this paper is, however, on the second, less explored, issue. This part of the view definition is performed using the following two steps: (1) view class selection and (2) view schema graph generation. For the first, we have developed a view definition language that can be used by the view definer to specify the selection of the desired view classes from the global schema. For the second, we have developed two algorithms that automatically augment the set of selected view classes to generate a complete, minimal and consistent view class generalization hierarchy. The first algorithm has linear complexity but it assumes that the global schema graph is a tree. The second algorithm overcomes this restricting assumption and thus allows for multiple inheritance, but it does so at the cost of a higher complexity
A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate
Merging process models and plant topology
The paper discusses the merging of first principles process models with plant topology derived in an automated way from a process drawing. The resulting structural models should make it easier for a range of methods from the literature to be applied to industrial-scale problems in process operation and design. © 2011 Zhejiang University
D4M 3.0: Extended Database and Language Capabilities
The D4M tool was developed to address many of today's data needs. This tool
is used by hundreds of researchers to perform complex analytics on unstructured
data. Over the past few years, the D4M toolbox has evolved to support
connectivity with a variety of new database engines, including SciDB.
D4M-Graphulo provides the ability to do graph analytics in the Apache Accumulo
database. Finally, an implementation using the Julia programming language is
also now available. In this article, we describe some of our latest additions
to the D4M toolbox and our upcoming D4M 3.0 release. We show through
benchmarking and scaling results that we can achieve fast SciDB ingest using
the D4M-SciDB connector, that using Graphulo can enable graph algorithms on
scales that can be memory limited, and that the Julia implementation of D4M
achieves comparable performance or exceeds that of the existing MATLAB(R)
implementation.Comment: IEEE HPEC 201
- …