597 research outputs found
Schema Independent Relational Learning
Learning novel concepts and relations from relational databases is an
important problem with many applications in database systems and machine
learning. Relational learning algorithms learn the definition of a new relation
in terms of existing relations in the database. Nevertheless, the same data set
may be represented under different schemas for various reasons, such as
efficiency, data quality, and usability. Unfortunately, the output of current
relational learning algorithms tends to vary quite substantially over the
choice of schema, both in terms of learning accuracy and efficiency. This
variation complicates their off-the-shelf application. In this paper, we
introduce and formalize the property of schema independence of relational
learning algorithms, and study both the theoretical and empirical dependence of
existing algorithms on the common class of (de) composition schema
transformations. We study both sample-based learning algorithms, which learn
from sets of labeled examples, and query-based algorithms, which learn by
asking queries to an oracle. We prove that current relational learning
algorithms are generally not schema independent. For query-based learning
algorithms we show that the (de) composition transformations influence their
query complexity. We propose Castor, a sample-based relational learning
algorithm that achieves schema independence by leveraging data dependencies. We
support the theoretical results with an empirical study that demonstrates the
schema dependence/independence of several algorithms on existing benchmark and
real-world datasets under (de) compositions
The tractability frontier of well-designed SPARQL queries
We study the complexity of query evaluation of SPARQL queries. We focus on
the fundamental fragment of well-designed SPARQL restricted to the AND,
OPTIONAL and UNION operators. Our main result is a structural characterisation
of the classes of well-designed queries that can be evaluated in polynomial
time. In particular, we introduce a new notion of width called domination
width, which relies on the well-known notion of treewidth. We show that, under
some complexity theoretic assumptions, the classes of well-designed queries
that can be evaluated in polynomial time are precisely those of bounded
domination width
Distributed XML Design
A distributed XML document is an XML document that spans several machines. We
assume that a distribution design of the document tree is given, consisting of
an XML kernel-document T[f1,...,fn] where some leaves are "docking points" for
external resources providing XML subtrees (f1,...,fn, standing, e.g., for Web
services or peers at remote locations). The top-down design problem consists
in, given a type (a schema document that may vary from a DTD to a tree
automaton) for the distributed document, "propagating" locally this type into a
collection of types, that we call typing, while preserving desirable
properties. We also consider the bottom-up design which consists in, given a
type for each external resource, exhibiting a global type that is enforced by
the local types, again with natural desirable properties. In the article, we
lay out the fundamentals of a theory of distributed XML design, analyze
problems concerning typing issues in this setting, and study their complexity.Comment: "56 pages, 4 figures
Ontology Alignment at the Instance and Schema Level
We present PARIS, an approach for the automatic alignment of ontologies.
PARIS aligns not only instances, but also relations and classes. Alignments at
the instance-level cross-fertilize with alignments at the schema-level.
Thereby, our system provides a truly holistic solution to the problem of
ontology alignment. The heart of the approach is probabilistic. This allows
PARIS to run without any parameter tuning. We demonstrate the efficiency of the
algorithm and its precision through extensive experiments. In particular, we
obtain a precision of around 90% in experiments with two of the world's largest
ontologies.Comment: Technical Report at INRIA RT-040
Context-Free Path Queries on RDF Graphs
Navigational graph queries are an important class of queries that canextract
implicit binary relations over the nodes of input graphs. Most of the
navigational query languages used in the RDF community, e.g. property paths in
W3C SPARQL 1.1 and nested regular expressions in nSPARQL, are based on the
regular expressions. It is known that regular expressions have limited
expressivity; for instance, some natural queries, like same generation-queries,
are not expressible with regular expressions. To overcome this limitation, in
this paper, we present cfSPARQL, an extension of SPARQL query language equipped
with context-free grammars. The cfSPARQL language is strictly more expressive
than property paths and nested expressions. The additional expressivity can be
used for modelling graph similarities, graph summarization and ontology
alignment. Despite the increasing expressivity, we show that cfSPARQL still
enjoys a low computational complexity and can be evaluated efficiently.Comment: 25 page
Beyond Worst-Case Analysis for Joins with Minesweeper
We describe a new algorithm, Minesweeper, that is able to satisfy stronger
runtime guarantees than previous join algorithms (colloquially, `beyond
worst-case guarantees') for data in indexed search trees. Our first
contribution is developing a framework to measure this stronger notion of
complexity, which we call {\it certificate complexity}, that extends notions of
Barbay et al. and Demaine et al.; a certificate is a set of propositional
formulae that certifies that the output is correct. This notion captures a
natural class of join algorithms. In addition, the certificate allows us to
define a strictly stronger notion of runtime complexity than traditional
worst-case guarantees. Our second contribution is to develop a dichotomy
theorem for the certificate-based notion of complexity. Roughly, we show that
Minesweeper evaluates -acyclic queries in time linear in the certificate
plus the output size, while for any -cyclic query there is some instance
that takes superlinear time in the certificate (and for which the output is no
larger than the certificate size). We also extend our certificate-complexity
analysis to queries with bounded treewidth and the triangle query.Comment: [This is the full version of our PODS'2014 paper.
Answering Conjunctive Queries under Updates
We consider the task of enumerating and counting answers to -ary
conjunctive queries against relational databases that may be updated by
inserting or deleting tuples. We exhibit a new notion of q-hierarchical
conjunctive queries and show that these can be maintained efficiently in the
following sense. During a linear time preprocessing phase, we can build a data
structure that enables constant delay enumeration of the query results; and
when the database is updated, we can update the data structure and restart the
enumeration phase within constant time. For the special case of self-join free
conjunctive queries we obtain a dichotomy: if a query is not q-hierarchical,
then query enumeration with sublinear delay and sublinear update time
(and arbitrary preprocessing time) is impossible.
For answering Boolean conjunctive queries and for the more general problem of
counting the number of solutions of k-ary queries we obtain complete
dichotomies: if the query's homomorphic core is q-hierarchical, then size of
the the query result can be computed in linear time and maintained with
constant update time. Otherwise, the size of the query result cannot be
maintained with sublinear update time. All our lower bounds rely on the
OMv-conjecture, a conjecture on the hardness of online matrix-vector
multiplication that has recently emerged in the field of fine-grained
complexity to characterise the hardness of dynamic problems. The lower bound
for the counting problem additionally relies on the orthogonal vectors
conjecture, which in turn is implied by the strong exponential time hypothesis.
By sublinear we mean for some
, where is the size of the active domain of the current
database
Reasoning about embedded dependencies using inclusion dependencies
The implication problem for the class of embedded dependencies is
undecidable. However, this does not imply lackness of a proof procedure as
exemplified by the chase algorithm. In this paper we present a complete
axiomatization of embedded dependencies that is based on the chase and uses
inclusion dependencies and implicit existential quantification in the
intermediate steps of deductions
XML Reconstruction View Selection in XML Databases: Complexity Analysis and Approximation Scheme
Query evaluation in an XML database requires reconstructing XML subtrees
rooted at nodes found by an XML query. Since XML subtree reconstruction can be
expensive, one approach to improve query response time is to use reconstruction
views - materialized XML subtrees of an XML document, whose nodes are
frequently accessed by XML queries. For this approach to be efficient, the
principal requirement is a framework for view selection. In this work, we are
the first to formalize and study the problem of XML reconstruction view
selection. The input is a tree , in which every node has a size
and profit , and the size limitation . The target is to find a subset
of subtrees rooted at nodes respectively such that
, and is maximal.
Furthermore, there is no overlap between any two subtrees selected in the
solution. We prove that this problem is NP-hard and present a fully
polynomial-time approximation scheme (FPTAS) as a solution
- …