674 research outputs found
Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1
We propose an efficient and scalable architecture for processing generalized
graph-pattern queries as they are specified by the current W3C recommendation
of the SPARQL 1.1 "Query Language" component. Specifically, the class of
queries we consider consists of sets of SPARQL triple patterns with labeled
property paths. From a relational perspective, this class resolves to
conjunctive queries of relational joins with additional graph-reachability
predicates. For the scalable, i.e., distributed, processing of this kind of
queries over very large RDF collections, we develop a suitable partitioning and
indexing scheme, which allows us to shard the RDF triples over an entire
cluster of compute nodes and to process an incoming SPARQL query over all of
the relevant graph partitions (and thus compute nodes) in parallel. Unlike most
prior works in this field, we specifically aim at the unified optimization and
distributed processing of queries consisting of both relational joins and
graph-reachability predicates. All communication among the compute nodes is
established via a proprietary, asynchronous communication protocol based on the
Message Passing Interface
Learning Tuple Probabilities
Learning the parameters of complex probabilistic-relational models from
labeled training data is a standard technique in machine learning, which has
been intensively studied in the subfield of Statistical Relational Learning
(SRL), but---so far---this is still an under-investigated topic in the context
of Probabilistic Databases (PDBs). In this paper, we focus on learning the
probability values of base tuples in a PDB from labeled lineage formulas. The
resulting learning problem can be viewed as the inverse problem to confidence
computations in PDBs: given a set of labeled query answers, learn the
probability values of the base tuples, such that the marginal probabilities of
the query answers again yield in the assigned probability labels. We analyze
the learning problem from a theoretical perspective, cast it into an
optimization problem, and provide an algorithm based on stochastic gradient
descent. Finally, we conclude by an experimental evaluation on three real-world
and one synthetic dataset, thus comparing our approach to various techniques
from SRL, reasoning in information extraction, and optimization
Generalized Lineage-Aware Temporal Windows: Supporting Outer and Anti Joins in Temporal-Probabilistic Databases
The result of a temporal-probabilistic (TP) join with negation includes, at
each time point, the probability with which a tuple of a positive relation
matches none of the tuples in a negative relation , for a
given join condition . TP outer and anti joins thus resemble the
characteristics of relational outer and anti joins also in the case when there
exist time points at which input tuples from have non-zero
probabilities to be and input tuples from have non-zero
probabilities to be , respectively. For the computation of TP joins with
negation, we introduce generalized lineage-aware temporal windows, a mechanism
that binds an output interval to the lineages of all the matching valid tuples
of each input relation. We group the windows of two TP relations into three
disjoint sets based on the way attributes, lineage expressions and intervals
are produced. We compute all windows in an incremental manner, and we show that
pipelined computations allow for the direct integration of our approach into
PostgreSQL. We thereby alleviate the prevalent redundancies in the interval
computations of existing approaches, which is proven by an extensive
experimental evaluation with real-world datasets
A General Framework for Anytime Approximation in Probabilistic Databases
Anytime approximation algorithms that compute the probabilities of queries
over probabilistic databases can be of great use to statistical learning tasks.
Those approaches have been based so far on either (i) sampling or (ii)
branch-and-bound with model-based bounds. We present here a more general
branch-and-bound framework that extends the possible bounds by using
'dissociation', which yields tighter bounds.Comment: 3 pages, 2 figures, submitted to StarAI 2018 Worksho
Learning word-to-concept mappings for automatic text classification
For both classification and retrieval of natural language text documents, the standard document representation is a term vector where a term is simply a morphological normal form of the corresponding word. A potentially better approach would be to map every word onto a concept, the proper word sense and use this additional information in the learning process. In this paper we address the problem of automatically classifying natural language text documents. We investigate the effect of word to concept mappings and word sense disambiguation techniques on improving classification accuracy. We use the WordNet thesaurus as a background knowledge base and propose a generative language model approach to document classification. We show experimental results comparing the performance of our model with Naive Bayes and SVM classifiers
Extraction of Conditional Probabilities of the Relationships Between Drugs, Diseases, and Genes from PubMed Guided by Relationships in PharmGKB
Guided by curated associations between genes, treatments (i.e., drugs), and diseases in pharmGKB, we constructed n-way Bayesian networks based on conditional probability tables (cpt’s) extracted from co-occurrence statistics over the entire Pubmed corpus, producing a broad-coverage analysis of the relationships between these biological entities. The networks suggest hypotheses regarding drug mechanisms, treatment biomarkers, and/or potential markers of genetic disease. The cpt’s enable Trio, an inferential database, to query indirect (inferred) relationships via an SQL-like query language
Lineage-Aware Temporal Windows: Supporting Set Operations in Temporal-Probabilistic Databases
In temporal-probabilistic (TP) databases, the combination of the temporal and
the probabilistic dimension adds significant overhead to the computation of set
operations. Although set queries are guaranteed to yield linearly sized output
relations, existing solutions exhibit quadratic runtime complexity. They suffer
from redundant interval comparisons and additional joins for the formation of
lineage expressions. In this paper, we formally define the semantics of set
operations in TP databases and study their properties. For their efficient
computation, we introduce the lineage-aware temporal window, a mechanism that
directly binds intervals with lineage expressions. We suggest the lineage-aware
window advancer (LAWA) for producing the windows of two TP relations in
linearithmic time, and we implement all TP set operations based on LAWA. By
exploiting the flexibility of lineage-aware temporal windows, we perform direct
filtering of irrelevant intervals and finalization of output lineage
expressions and thus guarantee that no additional computational cost or buffer
space is needed. A series of experiments over both synthetic and real-world
datasets show that (a) our approach has predictable performance, depending only
on the input size and not on the number of time intervals per fact or their
overlap, and that (b) it outperforms state-of-the-art approaches in both
temporal and probabilistic databases
- …