1,944 research outputs found

    Deriving Probabilistic Databases with Inference Ensembles

    Get PDF
    Many real-world applications deal with uncertain or missing data, prompting a surge of activity in the area of probabilistic databases. A shortcoming of prior work is the assumption that an appropriate probabilistic model, along with the necessary probability distributions, is given. We address this shortcoming by presenting a framework for learning a set of inference ensembles, termed meta-rule semi-lattices, or MRSL, from the complete portion of the data. We use the MRSL to infer probability distributions for missing data, and demonstrate experimentally that high accuracy is achieved when a single attribute value is missing per tuple. We next propose an inference algorithm based on Gibbs sampling that accurately predicts the probability distribution for multiple missing values. We also develop an optimization that greatly improves performance of multi-attribute inference for collections of tuples, while maintaining high accuracy. Finally, we develop an experimental framework to evaluate the efficiency and accuracy of our approach

    Learning Tuple Probabilities

    Get PDF
    Learning the parameters of complex probabilistic-relational models from labeled training data is a standard technique in machine learning, which has been intensively studied in the subfield of Statistical Relational Learning (SRL), but---so far---this is still an under-investigated topic in the context of Probabilistic Databases (PDBs). In this paper, we focus on learning the probability values of base tuples in a PDB from labeled lineage formulas. The resulting learning problem can be viewed as the inverse problem to confidence computations in PDBs: given a set of labeled query answers, learn the probability values of the base tuples, such that the marginal probabilities of the query answers again yield in the assigned probability labels. We analyze the learning problem from a theoretical perspective, cast it into an optimization problem, and provide an algorithm based on stochastic gradient descent. Finally, we conclude by an experimental evaluation on three real-world and one synthetic dataset, thus comparing our approach to various techniques from SRL, reasoning in information extraction, and optimization

    Defining and characterising structural uncertainty in decision analytic models

    Get PDF
    An inappropriate structure for a decision analytic model can potentially invalidate estimates of cost-effectiveness and estimates of the value of further research. However, there are often a number of alternative and credible structural assumptions which can be made. Although it is common practice to acknowledge potential limitations in model structure, there is a lack of clarity about methods to characterize the uncertainty surrounding alternative structural assumptions and their contribution to decision uncertainty. A review of decision models commissioned by the NHS Health Technology Programme was undertaken to identify the types of model uncertainties described in the literature. A second review was undertaken to identify approaches to characterise these uncertainties. The assessment of structural uncertainty has received little attention in the health economics literature. A common method to characterise structural uncertainty is to compute results for each alternative model specification, and to present alternative results as scenario analyses. It is then left to decision maker to assess the credibility of the alternative structures in interpreting the range of results. The review of methods to explicitly characterise structural uncertainty identified two methods: 1) model averaging, where alternative models, with different specifications, are built, and their results averaged, using explicit prior distributions often based on expert opinion and 2) Model selection on the basis of prediction performance or goodness of fit. For a number of reasons these methods are neither appropriate nor desirable methods to characterize structural uncertainty in decision analytic models. When faced with a choice between multiple models, another method can be employed which allows structural uncertainty to be explicitly considered and does not ignore potentially relevant model structures. Uncertainty can be directly characterised (or parameterised) in the model itself. This method is analogous to model averaging on individual or sets of model inputs, but also allows the value of information associated with structural uncertainties to be resolved.

    Learning in imbalanced relational data

    Full text link
    Traditional learning techniques learn from flat data files with the assumption that each class has a similar number of examples. However, the majority of real-world data are stored as relational systems with imbalanced data distribution, where one class of data is over-represented as compared with other classes. We propose to extend a relational learning technique called Probabilistic Relational Models (PRMs) to deal with the imbalanced class problem. We address learning from imbalanced relational data using an ensemble of PRMs and propose a new model: the PRMs-IM. We show the performance of PRMs-IM on a real university relational database to identify students at risk

    Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

    Get PDF
    Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

    Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

    Full text link
    The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches
    corecore