11,966 research outputs found
Adding Logical Operators to Tree Pattern Queries on Graph-Structured Data
As data are increasingly modeled as graphs for expressing complex
relationships, the tree pattern query on graph-structured data becomes an
important type of queries in real-world applications. Most practical query
languages, such as XQuery and SPARQL, support logical expressions using
logical-AND/OR/NOT operators to define structural constraints of tree patterns.
In this paper, (1) we propose generalized tree pattern queries (GTPQs) over
graph-structured data, which fully support propositional logic of structural
constraints. (2) We make a thorough study of fundamental problems including
satisfiability, containment and minimization, and analyze the computational
complexity and the decision procedures of these problems. (3) We propose a
compact graph representation of intermediate results and a pruning approach to
reduce the size of intermediate results and the number of join operations --
two factors that often impair the efficiency of traditional algorithms for
evaluating tree pattern queries. (4) We present an efficient algorithm for
evaluating GTPQs using 3-hop as the underlying reachability index. (5)
Experiments on both real-life and synthetic data sets demonstrate the
effectiveness and efficiency of our algorithm, from several times to orders of
magnitude faster than state-of-the-art algorithms in terms of evaluation time,
even for traditional tree pattern queries with only conjunctive operations.Comment: 16 page
Processing techniques for partial tree-pattern queries on XML data
In recent years, eXtensible Markup Language (XML) has become a de facto standard for exporting and exchanging data on the Web. XML structures data as trees. Querying capabilities are provided through patterns matched against the XML trees. Research on the processing of XML queries has focused mainly on tree-pattern queries. Tree-pattern queries are not appropriate for querying XML data sources whose structure is not fully known to the user, or for querying multiple data sources which structure information differently. Recently, a class of queries, called Partial Tree-Pattern Queries (PTPQs) was identified. A central feature of PTPQs is that the structure can be specified fully, partially, or not at all in a query. For this reason. PTPQs can be used for flexibly querying XML data sources.
This thesis deals with processing techniques for PTPQs. In particular, it addresses the satisfiability, containment and minimization problems for PTPQs. In order to cope with structural expression derivation issues and to compare PTPQs, a set of inference rules is suggested and a canonical form for PTPQs that comprises all derived structural expressions is defined. This canonical form is used for determining necessary and sufficient conditions for PTPQ satisfiability.
The containment problem is studied both in the absence and in the presence of structural summaries of data called dimension graphs. It is shown that this problem cannot be characterized by homomorphisms between PTPQs, even when PTPQs are put in canonical form. In both cases of the problem, necessary and sufficient conditions for PTPQ containment are provided in terms of homomorphisms between PTPQs and (a possibly exponential number of) tree-pattern queries. This result is used to identify a subclass of PTPQs that strictly contains tree-pattern queries for which the containment problem can be fully characterized through the existence of homomorphisms. To cope with the high complexity of PTPQ containment, heuristic approaches for this problem are designed that trade accuracy for speed. The heuristic approaches equivalently add structural expressions to PTPQs in order to increase the possibility for a homomorphism between two contained PTPQs to exist. An implementation and extensive experimental evaluation of these heuristics shows that they are useful in practice, and that they can be efficiently implemented in a query optimizer.
The goal of PTPQ minimization is to produce an equivalent PTPQ which is syntactically smaller in size. This problem is studied in the absence of structural summaries. It is shown that PTPQs cannot be minimized by removing redundant parts as is the case with certain classes of tree-pattern queries. It is also shown that, in general, a PTPQ does not have a unique minimal equivalent PTPQ. Finally, sound, but not complete, heuristic approaches for PTPQ minimization are presented. These approaches gradually trade execution time for accuracy
Pattern tree-based XOLAP rollup operator for XML complex hierarchies
With the rise of XML as a standard for representing business data, XML data
warehousing appears as a suitable solution for decision-support applications.
In this context, it is necessary to allow OLAP analyses on XML data cubes.
Thus, XQuery extensions are needed. To define a formal framework and allow
much-needed performance optimizations on analytical queries expressed in
XQuery, defining an algebra is desirable. However, XML-OLAP (XOLAP) algebras
from the literature still largely rely on the relational model. Hence, we
propose in this paper a rollup operator based on a pattern tree in order to
handle multidimensional XML data expressed within complex hierarchies
Completing Queries: Rewriting of IncompleteWeb Queries under Schema Constraints
Reactive Web systems, Web services, and Web-based publish/
subscribe systems communicate events as XML messages, and in
many cases require composite event detection: it is not sufficient to react
to single event messages, but events have to be considered in relation to
other events that are received over time.
Emphasizing language design and formal semantics, we describe the
rule-based query language XChangeEQ for detecting composite events.
XChangeEQ is designed to completely cover and integrate the four complementary
querying dimensions: event data, event composition, temporal
relationships, and event accumulation. Semantics are provided as
model and fixpoint theories; while this is an established approach for rule
languages, it has not been applied for event queries before
Training linear ranking SVMs in linearithmic time using red-black trees
We introduce an efficient method for training the linear ranking support
vector machine. The method combines cutting plane optimization with red-black
tree based approach to subgradient calculations, and has O(m*s+m*log(m)) time
complexity, where m is the number of training examples, and s the average
number of non-zero features per example. Best previously known training
algorithms achieve the same efficiency only for restricted special cases,
whereas the proposed approach allows any real valued utility scores in the
training data. Experiments demonstrate the superior scalability of the proposed
approach, when compared to the fastest existing RankSVM implementations.Comment: 20 pages, 4 figure
Minimizing the average distance to a closest leaf in a phylogenetic tree
When performing an analysis on a collection of molecular sequences, it can be
convenient to reduce the number of sequences under consideration while
maintaining some characteristic of a larger collection of sequences. For
example, one may wish to select a subset of high-quality sequences that
represent the diversity of a larger collection of sequences. One may also wish
to specialize a large database of characterized "reference sequences" to a
smaller subset that is as close as possible on average to a collection of
"query sequences" of interest. Such a representative subset can be useful
whenever one wishes to find a set of reference sequences that is appropriate to
use for comparative analysis of environmentally-derived sequences, such as for
selecting "reference tree" sequences for phylogenetic placement of metagenomic
reads. In this paper we formalize these problems in terms of the minimization
of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms
to perform the relevant minimization. We show that the greedy algorithm is not
effective, show that a variant of the Partitioning Among Medoids (PAM)
heuristic gets stuck in local minima, and develop an exact dynamic programming
approach. Using this exact program we note that the performance of PAM appears
to be good for simulated trees, and is faster than the exact algorithm for
small trees. On the other hand, the exact program gives solutions for all
numbers of leaves less than or equal to the given desired number of leaves,
while PAM only gives a solution for the pre-specified number of leaves. Via
application to real data, we show that the ADCL criterion chooses chimeric
sequences less often than random subsets, while the maximization of
phylogenetic diversity chooses them more often than random. These algorithms
have been implemented in publicly available software.Comment: Please contact us with any comments or questions
Capturing Topology in Graph Pattern Matching
Graph pattern matching is often defined in terms of subgraph isomorphism, an
NP-complete problem. To lower its complexity, various extensions of graph
simulation have been considered instead. These extensions allow pattern
matching to be conducted in cubic-time. However, they fall short of capturing
the topology of data graphs, i.e., graphs may have a structure drastically
different from pattern graphs they match, and the matches found are often too
large to understand and analyze. To rectify these problems, this paper proposes
a notion of strong simulation, a revision of graph simulation, for graph
pattern matching. (1) We identify a set of criteria for preserving the topology
of graphs matched. We show that strong simulation preserves the topology of
data graphs and finds a bounded number of matches. (2) We show that strong
simulation retains the same complexity as earlier extensions of simulation, by
providing a cubic-time algorithm for computing strong simulation. (3) We
present the locality property of strong simulation, which allows us to
effectively conduct pattern matching on distributed graphs. (4) We
experimentally verify the effectiveness and efficiency of these algorithms,
using real-life data and synthetic data.Comment: VLDB201
- …