4,297 research outputs found
Querying Incomplete Numerical Data: Between Certain and Possible Answers
International audienc
Coping with Incomplete Data: Recent Advances
Handling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on three-valued logic. Can we find a middle ground, and produce correct answers efficiently? The paper surveys results of the last few years motivated by this question. We re-examine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers
Coping with Incomplete Data: Recent Advances
International audienceHandling incomplete data in a correct manner is a notoriously hard problem in databases. Theoretical approaches rely on the computationally hard notion of certain answers, while practical solutions rely on ad hoc query evaluation techniques based on threevalued logic. Can we find a middle ground, and produce correct answers efficiently? The paper surveys results of the last few years motivated by this question. We reexamine the notion of certainty itself, and show that it is much more varied than previously thought. We identify cases when certain answers can be computed efficiently and, short of that, provide deterministic and probabilistic approximation schemes for them. We look at the role of three-valued logic as used in SQL query evaluation, and discuss the correctness of the choice, as well as the necessity of such a logic for producing query answers
Four Lessons in Versatility or How Query Languages Adapt to the Web
Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”
On the Implementation of the Probabilistic Logic Programming Language ProbLog
The past few years have seen a surge of interest in the field of
probabilistic logic learning and statistical relational learning. In this
endeavor, many probabilistic logics have been developed. ProbLog is a recent
probabilistic extension of Prolog motivated by the mining of large biological
networks. In ProbLog, facts can be labeled with probabilities. These facts are
treated as mutually independent random variables that indicate whether these
facts belong to a randomly sampled program. Different kinds of queries can be
posed to ProbLog programs. We introduce algorithms that allow the efficient
execution of these queries, discuss their implementation on top of the
YAP-Prolog system, and evaluate their performance in the context of large
networks of biological entities.Comment: 28 pages; To appear in Theory and Practice of Logic Programming
(TPLP
Classification-Aware Hidden-Web Text Database Selection,
Many valuable text databases on the web have noncrawlable contents that are “hidden” behind
search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web”
text databases at once through a unified query interface. An important step in the metasearching
process is database selection, or determining which databases are the most relevant for a given
user query. The state-of-the-art database selection techniques rely on statistical summaries of the
database contents, generally including the database vocabulary and associated word frequencies.
Unfortunately, hidden-web text databases typically do not export such summaries, so previous research
has developed algorithms for constructing approximate content summaries from document
samples extracted from the databases via querying.We present a novel “focused-probing” sampling
algorithm that detects the topics covered in a database and adaptively extracts documents that
are representative of the topic coverage of the database. Our algorithm is the first to construct
content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s
law practically guarantees that for any relatively large database, content summaries built from
moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete
content summaries might negatively affect the database selection process, especially for short
queries with infrequent words. To enhance the sparse document samples and improve the database
selection decisions, we exploit the fact that topically similar databases tend to have similar
vocabularies, so samples extracted from databases with a similar topical focus can complement
each other. We have developed two database selection algorithms that exploit this observation.
The first algorithm proceeds hierarchically and selects the best categories for a query, and then
sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data,
to enhance the database content summaries with category-specific words.We describe how to modify
existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is
beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases
as well as TREC data, suggests that the proposed sampling methods generate high-quality
content summaries and that the database selection algorithms produce significantly more relevant
database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Learning Tuple Probabilities
Learning the parameters of complex probabilistic-relational models from
labeled training data is a standard technique in machine learning, which has
been intensively studied in the subfield of Statistical Relational Learning
(SRL), but---so far---this is still an under-investigated topic in the context
of Probabilistic Databases (PDBs). In this paper, we focus on learning the
probability values of base tuples in a PDB from labeled lineage formulas. The
resulting learning problem can be viewed as the inverse problem to confidence
computations in PDBs: given a set of labeled query answers, learn the
probability values of the base tuples, such that the marginal probabilities of
the query answers again yield in the assigned probability labels. We analyze
the learning problem from a theoretical perspective, cast it into an
optimization problem, and provide an algorithm based on stochastic gradient
descent. Finally, we conclude by an experimental evaluation on three real-world
and one synthetic dataset, thus comparing our approach to various techniques
from SRL, reasoning in information extraction, and optimization
Top-k Querying of Unknown Values under Order Constraints
Many practical scenarios make it necessary to evaluate top-k queries over data items with partially unknown values. This paper considers a setting where the values are taken from a numerical domain, and where some partial order constraints are given over known and unknown values: under these constraints, we assume that all possible worlds are equally likely.
Our work is the first to propose a principled scheme to derive the value distributions and expected values of unknown items in this setting, with the goal of computing estimated top-k results by interpolating the unknown values from the known ones. We study the complexity of this general task, and show tight complexity bounds, proving that the problem is intractable, but
can be tractably approximated. We then consider the case of tree-shaped partial orders, where we show a constructive PTIME solution. We also compare our problem setting to other top-k definitions on uncertain data
- …