Search CORE

6,574 research outputs found

Duplicate Detection in Probabilistic Data

Author: Keijzer Ander de
Keulen Maurice van
Panse Fabian
Ritter Norbert
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2009
Field of study

Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data

CiteSeerX

Crossref

University of Twente Research Information

Probabilistic data flow analysis: a linear equational approach

Author: Di Pierro Alessandra
Wiklicky Herbert
Publication venue: 'Open Publishing Association'
Publication date: 01/01/2013
Field of study

Speculative optimisation relies on the estimation of the probabilities that certain properties of the control flow are fulfilled. Concrete or estimated branch probabilities can be used for searching and constructing advantageous speculative and bookkeeping transformations. We present a probabilistic extension of the classical equational approach to data-flow analysis that can be used to this purpose. More precisely, we show how the probabilistic information introduced in a control flow graph by branch prediction can be used to extract a system of linear equations from a program and present a method for calculating correct (numerical) solutions.Comment: In Proceedings GandALF 2013, arXiv:1307.416

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Catalogo dei prodotti della ricerca

Spiral - Imperial College Digital Repository

Open Access Repository

A probabilistic data-driven model for planar pushing

Author: Bauza Maria
Rodriguez Alberto
Publication venue
Publication date: 23/09/2017
Field of study

This paper presents a data-driven approach to model planar pushing interaction to predict both the most likely outcome of a push and its expected variability. The learned models rely on a variation of Gaussian processes with input-dependent noise called Variational Heteroscedastic Gaussian processes (VHGP) that capture the mean and variance of a stochastic function. We show that we can learn accurate models that outperform analytical models after less than 100 samples and saturate in performance with less than 1000 samples. We validate the results against a collected dataset of repeated trajectories, and use the learned models to study questions such as the nature of the variability in pushing, and the validity of the quasi-static assumption.Comment: 8 pages, 11 figures, ICRA 201

arXiv.org e-Print Archive

Crossref

Implications of probabilistic data modeling for rule mining

Author: Hahsler Michael
Hornik Kurt
Reutterer Thomas
Publication venue: Institut für Statistik und Mathematik, WU Vienna University of Economics and Business
Publication date: 01/01/2005
Field of study

Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine associations are discussed in great detail. In this paper we investigate properties of transaction data sets from a probabilistic point of view. We present a simple probabilistic framework for transaction data and its implementation using the R statistical computing environment. The framework can be used to simulate transaction data when no associations are present. We use such data to explore the ability to filter noise of confidence and lift, two popular interest measures used for rule mining. Based on the framework we develop the measure hyperlift and we compare this new measure to lift using simulated data and a real-world grocery database.Series: Research Report Series / Department of Statistics and Mathematic

Elektronische Publikationen der Wirtschaftsuniversität Wien

Qualitative Effects of Knowledge Rules in Probabilistic Data Integration

Author: Keijzer A. de
Keulen M. van
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used

CiteSeerX

University of Twente Research Information

Challenges for Efficient Query Evaluation on Structured Probabilistic Data

Author: A Amarilli
A Darwiche
B Courcelle
D Berwanger
E Grädel
G Gottlob
H Andréka
HL Bodlaender
HL Bodlaender
N Dalvi
S Arnborg
SL Lauritzen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/07/2016
Field of study

Query answering over probabilistic data is an important task but is generally intractable. However, a new approach for this problem has recently been proposed, based on structural decompositions of input databases, following, e.g., tree decompositions. This paper presents a vision for a database management system for probabilistic data built following this structural approach. We review our existing and ongoing work on this topic and highlight many theoretical and practical challenges that remain to be addressed.Comment: 9 pages, 1 figure, 23 references. Accepted for publication at SUM 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

Crossref

HAL-Rennes 1

Probabilistic data types

Author: Fernandes Pedro Henrique Moreira Gomes
Publication venue
Publication date: 03/12/2021
Field of study

Dissertação de mestrado integrado em Engenharia InformáticaConflict-Free Replicated Data Types (CRDTs) provide deterministic outcomes from concurrent executions. The conflict resolution mechanism uses information on the ordering of the last operations performed, which indicates if a given operation is known by a replica, typically using some variant of version vectors. This thesis will explore the construction of CRDTs that use a novel stochastic mechanism that can track with high accuracy knowledge of the occurrence of recently performed operations and with less accuracy for older operations. The aim is to obtain better scaling properties and avoid the use of metadata that is linear on the number of replicas.Conflict-Free Replicated Data Types (CRDTs) oferecem resultados determinísticos de execuções concorrentes. O mecanismo de resolução de conflitos usa informação sobre a ordenação das últimas operações realizadas, que indica se uma dada operação é conhecida por uma réplica, geralmente usando alguma variante de version vectors. Esta tese explorara a construção de CRDTs que utilizam um novo mecanismo estocástico que pode identificar com alta precisão o conhecimento sobre a ocorrência de operações realizadas recentemente e com menor precisão para operações mais antigas. O objetivo é a obtenção de melhores propriedades de escalabilidade e evitar o uso de metadados em quantidade linear em relação ao número de réplicas

Universidade do Minho: RepositoriUM