59,690 research outputs found

    Probabilistic Inference from Arbitrary Uncertainty using Mixtures of Factorized Generalized Gaussians

    Full text link
    This paper presents a general and efficient framework for probabilistic inference and learning from arbitrary uncertain information. It exploits the calculation properties of finite mixture models, conjugate families and factorization. Both the joint probability density of the variables and the likelihood function of the (objective or subjective) observation are approximated by a special mixture model, in such a way that any desired conditional distribution can be directly obtained without numerical integration. We have developed an extended version of the expectation maximization (EM) algorithm to estimate the parameters of mixture models from uncertain training examples (indirect observations). As a consequence, any piece of exact or uncertain information about both input and output values is consistently handled in the inference and learning stages. This ability, extremely useful in certain situations, is not found in most alternative methods. The proposed framework is formally justified from standard probabilistic principles and illustrative examples are provided in the fields of nonparametric pattern classification, nonlinear regression and pattern completion. Finally, experiments on a real application and comparative results over standard databases provide empirical evidence of the utility of the method in a wide range of applications

    Taming Data Explosion in Probabilistic Information Integration

    Get PDF
    Data integration has been a challenging problem for decades. In an ambient environment, where many autonomous devices have their own information sources and network connectivity is ad hoc and peer-to-peer, it even becomes a serious bottleneck. To enable devices to exchange information without the need for interaction with a user at data integration time and without the need for extensive semantic annotations, a probabilistic approach seems rather promising. It simply teaches the device how to cope with the uncertainty occurring during data integration. Unfortunately, without any kind of world knowledge, almost everything becomes uncertain, hence maintaining all possibilities produces huge integrated information sources. In this paper, we claim that only very simple and generic rules are enough world knowledge to drastically reduce the amount of uncertainty, hence to tame the data explosion to a manageable size

    Uncertain data integration with probabilities

    Get PDF
    Real world applications that deal with information extraction, such as business intelligence software or sensor data management, must often process data provided with varying degrees of uncertainty. Uncertainty can result from multiple or inconsistent sources, as well as approximate schema mappings. Modeling, managing and integrating uncertain data from multiple sources has been an active area of research in recent years. In particular, data integration systems free the user from the tedious tasks of finding relevant data sources, interacting with each source in isolation using its corresponding interface and combining data from multiple sources by providing a uniform query interface to gain access to the integrated information. Previous work has integrated uncertain data using representation models such as the possible worlds and probabilistic relations. We extend this work by determining the probabilities of possible worlds of an extended probabilistic relation. We also present an algorithm to determine when a given extended probabilistic relation can be obtained by the integration of two probabilistic relations and give the decomposed pairs of probabilistic relations

    An Answer Explanation Model for Probabilistic Database Queries

    Get PDF
    Following the availability of huge amounts of uncertain data, coming from diverse ranges of applications such as sensors, machine learning or mining approaches, information extraction and integration, etc. in recent years, we have seen a revival of interests in probabilistic databases. Queries over these databases result in probabilistic answers. As the process of arriving at these answers is based on the underlying stored uncertain data, we argue that from the standpoint of an end user, it is helpful for such a system to give an explanation on how it arrives at an answer and on which uncertainty assumptions the derived answer is based. In this way, the user with his/her own knowledge can decide how much confidence to place in this probabilistic answer. \ud The aim of this paper is to design such an answer explanation model for probabilistic database queries. We report our design principles and show the methods to compute the answer explanations. One of the main contributions of our model is that it fills the gap between giving only the answer probability, and giving the full derivation. Furthermore, we show how to balance verifiability and influence of explanation components through the concept of verifiable views. The behavior of the model and its computational efficiency are demonstrated through an extensive performance study

    Probabilistic Shortest Time Queries Over Uncertain Road Networks

    Get PDF
    In many real applications such as location-based services (LBS), map utilities, trip planning, and transportation systems, it is very useful and important to provide query services over spatial road networks. Nowadays we can easily obtain rich traffic information such as the speeds of vehicles on roads. However, due to the inaccuracy of devices or integration in consistencies, the traffic data (i.e., speeds) are often imprecise and uncertain. In this paper, we model road networks by uncertain graphs, which contain edges that are associated with probabilistic velocities. We formalize the problem of probabilistic shortest time query, and we propose time bound pruning and probabilistic bound pruning to filter out false alarms. Moreover, we design offline pre-computation to facilitate PSTQ processing

    Probabilistic constraint reasoning

    Get PDF
    Dissertação apresentada para obtenção do Grau de Doutor em Engenharia Informática, pela Universidade Nova de Lisboa, Faculdade de Ciências e TecnologiaThe continuous constraint paradigm has been often used to model safe reasoning in applications where uncertainty arises. Constraint propagation propagates intervals of uncertainty among the variables of the problem, eliminating values that do not belong to any solution. However, constraint programming is very conservative: if initial intervals are wide (reflecting large uncertainty), the obtained safe enclosure of all consistent scenarios may be inadequately wide for decision support. Since all scenarios are considered equally likely, insufficient pruning leads to great inefficiency if some costly decisions may be justified by very unlikely scenarios. Even when probabilistic information is available for the variables of the problem, the continuous constraint paradigm is unable to incorporate and reason with such information. Therefore, it is incapable of distinguishing between different scenarios, based on their likelihoods. This thesis presents a probabilistic continuous constraint paradigm that associates a probabilistic space to the variables of the problem, enabling probabilistic reasoning to complement the underlying constraint reasoning. Such reasoning is used to address probabilistic queries and requires the computation of multi-dimensional integrals on possibly non linear integration regions. Suitable algorithms for such queries are developed, using safe or approximate integration techniques and relying on methods from continuous constraint programming in order to compute safe covers of the integration region. The thesis illustrates the adequacy of the probabilistic continuous constraint framework for decision support in nonlinear continuous problems with uncertain information, namely on inverse and reliability problems, two different types of engineering problems where the developed framework is particularly adequate to support decision makers

    Scalable Statistical Modeling and Query Processing over Large Scale Uncertain Databases

    Get PDF
    The past decade has witnessed a large number of novel applications that generate imprecise, uncertain and incomplete data. Examples include monitoring infrastructures such as RFIDs, sensor networks and web-based applications such as information extraction, data integration, social networking and so on. In my dissertation, I addressed several challenges in managing such data and developed algorithms for efficiently executing queries over large volumes of such data. Specifically, I focused on the following challenges. First, for meaningful analysis of such data, we need the ability to remove noise and infer useful information from uncertain data. To address this challenge, I first developed a declarative system for applying dynamic probabilistic models to databases and data streams. The output of such probabilistic modeling is probabilistic data, i.e., data annotated with probabilities of correctness/existence. Often, the data also exhibits strong correlations. Although there is prior work in managing and querying such probabilistic data using probabilistic databases, those approaches largely assume independence and cannot handle probabilistic data with rich correlation structures. Hence, I built a probabilistic database system that can manage large-scale correlations and developed algorithms for efficient query evaluation. Our system allows users to provide uncertain data as input and to specify arbitrary correlations among the entries in the database. In the back end, we represent correlations as a forest of junction trees, an alternative representation for probabilistic graphical models (PGM). We execute queries over the probabilistic database by transforming them into message passing algorithms (inference) over the junction tree. However, traditional algorithms over junction trees typically require accessing the entire tree, even for small queries. Hence, I developed an index data structure over the junction tree called INDSEP that allows us to circumvent this process and thereby scalably evaluate inference queries, aggregation queries and SQL queries over the probabilistic database. Finally, query evaluation in probabilistic databases typically returns output tuples along with their probability values. However, the existing query evaluation model provides very little intuition to the users: for instance, a user might want to know Why is this tuple in my result? or Why does this output tuple have such high probability? or Which are the most influential input tuples for my query ?'' Hence, I designed a query evaluation model, and a suite of algorithms, that provide users with explanations for query results, and enable users to perform sensitivity analysis to better understand the query results

    Qualitative Effects of Knowledge Rules in Probabilistic Data Integration

    Get PDF
    One of the problems in data integration is data overlap: the fact that different data sources have data on the same real world entities. Much development time in data integration projects is devoted to entity resolution. Often advanced similarity measurement techniques are used to remove semantic duplicates from the integration result or solve other semantic conflicts, but it proofs impossible to get rid of all semantic problems in data integration. An often-used rule of thumb states that about 90% of the development effort is devoted to solving the remaining 10% hard cases. In an attempt to significantly decrease human effort at data integration time, we have proposed an approach that stores any remaining semantic uncertainty and conflicts in a probabilistic database enabling it to already be meaningfully used. The main development effort in our approach is devoted to defining and tuning knowledge rules and thresholds. Rules and thresholds directly impact the size and quality of the integration result. We measure integration quality indirectly by measuring the quality of answers to queries on the integrated data set in an information retrieval-like way. The main contribution of this report is an experimental investigation of the effects and sensitivity of rule definition and threshold tuning on the integration quality. This proves that our approach indeed reduces development effort — and not merely shifts the effort to rule definition and threshold tuning — by showing that setting rough safe thresholds and defining only a few rules suffices to produce a ‘good enough’ integration that can be meaningfully used

    Duplicate Detection in Probabilistic Data

    Get PDF
    Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data
    corecore