6,574 research outputs found
Duplicate Detection in Probabilistic Data
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data
Probabilistic data flow analysis: a linear equational approach
Speculative optimisation relies on the estimation of the probabilities that
certain properties of the control flow are fulfilled. Concrete or estimated
branch probabilities can be used for searching and constructing advantageous
speculative and bookkeeping transformations.
We present a probabilistic extension of the classical equational approach to
data-flow analysis that can be used to this purpose. More precisely, we show
how the probabilistic information introduced in a control flow graph by branch
prediction can be used to extract a system of linear equations from a program
and present a method for calculating correct (numerical) solutions.Comment: In Proceedings GandALF 2013, arXiv:1307.416
A probabilistic data-driven model for planar pushing
This paper presents a data-driven approach to model planar pushing
interaction to predict both the most likely outcome of a push and its expected
variability. The learned models rely on a variation of Gaussian processes with
input-dependent noise called Variational Heteroscedastic Gaussian processes
(VHGP) that capture the mean and variance of a stochastic function. We show
that we can learn accurate models that outperform analytical models after less
than 100 samples and saturate in performance with less than 1000 samples. We
validate the results against a collected dataset of repeated trajectories, and
use the learned models to study questions such as the nature of the variability
in pushing, and the validity of the quasi-static assumption.Comment: 8 pages, 11 figures, ICRA 201
Implications of probabilistic data modeling for rule mining
Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this paper we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.Series: Research Report Series / Department of Statistics and Mathematic
Qualitative Effects of Knowledge Rules in Probabilistic Data Integration
One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort â and not merely shifts the effort to rule definition and threshold tuning â by showing that setting rough safe thresholds and defining only a few rules suffices to produce a âgood enoughâ integration that can be meaningfully used
Challenges for Efficient Query Evaluation on Structured Probabilistic Data
Query answering over probabilistic data is an important task but is generally
intractable. However, a new approach for this problem has recently been
proposed, based on structural decompositions of input databases, following,
e.g., tree decompositions. This paper presents a vision for a database
management system for probabilistic data built following this structural
approach. We review our existing and ongoing work on this topic and highlight
many theoretical and practical challenges that remain to be addressed.Comment: 9 pages, 1 figure, 23 references. Accepted for publication at SUM
201
Probabilistic data types
Dissertação de mestrado integrado em Engenharia InformåticaConflict-Free Replicated Data Types (CRDTs) provide deterministic outcomes from concurrent
executions. The conflict resolution mechanism uses information on the ordering of the last
operations performed, which indicates if a given operation is known by a replica, typically
using some variant of version vectors. This thesis will explore the construction of CRDTs
that use a novel stochastic mechanism that can track with high accuracy knowledge of the
occurrence of recently performed operations and with less accuracy for older operations.
The aim is to obtain better scaling properties and avoid the use of metadata that is linear on
the number of replicas.Conflict-Free Replicated Data Types (CRDTs) oferecem resultados determinĂsticos de execuçÔes
concorrentes. O mecanismo de resolução de conflitos usa informação sobre a ordenação das Ășltimas operaçÔes realizadas, que indica se uma dada operação Ă© conhecida por uma rĂ©plica, geralmente usando alguma variante de version vectors. Esta tese explorara a construção de CRDTs que utilizam um novo mecanismo estocĂĄstico que pode identificar com alta precisĂŁo
o conhecimento sobre a ocorrĂȘncia de operaçÔes realizadas recentemente e com menor
precisĂŁo para operaçÔes mais antigas. O objetivo Ă© a obtenção de melhores propriedades de escalabilidade e evitar o uso de metadados em quantidade linear em relação ao nĂșmero de rĂ©plicas
- âŠ