1,944 research outputs found
Deriving Probabilistic Databases with Inference Ensembles
Many real-world applications deal with uncertain or missing data, prompting a surge of activity in the area of probabilistic databases. A shortcoming of prior work is the assumption that an appropriate probabilistic model, along with the necessary probability distributions, is given. We address this shortcoming by presenting a framework for learning a set of inference ensembles, termed meta-rule semi-lattices, or MRSL, from the complete portion of the data. We use the MRSL to infer probability distributions for missing data, and demonstrate experimentally that high accuracy is achieved when a single attribute value is missing per tuple. We next propose an inference algorithm based on Gibbs sampling that accurately predicts the probability distribution for multiple missing values. We also develop an optimization that greatly improves performance of multi-attribute inference for collections of tuples, while maintaining high accuracy. Finally, we develop an experimental framework to evaluate the efficiency and accuracy of our approach
Learning Tuple Probabilities
Learning the parameters of complex probabilistic-relational models from
labeled training data is a standard technique in machine learning, which has
been intensively studied in the subfield of Statistical Relational Learning
(SRL), but---so far---this is still an under-investigated topic in the context
of Probabilistic Databases (PDBs). In this paper, we focus on learning the
probability values of base tuples in a PDB from labeled lineage formulas. The
resulting learning problem can be viewed as the inverse problem to confidence
computations in PDBs: given a set of labeled query answers, learn the
probability values of the base tuples, such that the marginal probabilities of
the query answers again yield in the assigned probability labels. We analyze
the learning problem from a theoretical perspective, cast it into an
optimization problem, and provide an algorithm based on stochastic gradient
descent. Finally, we conclude by an experimental evaluation on three real-world
and one synthetic dataset, thus comparing our approach to various techniques
from SRL, reasoning in information extraction, and optimization
Defining and characterising structural uncertainty in decision analytic models
An inappropriate structure for a decision analytic model can potentially invalidate estimates of cost-effectiveness and estimates of the value of further research. However, there are often a number of alternative and credible structural assumptions which can be made. Although it is common practice to acknowledge potential limitations in model structure, there is a lack of clarity about methods to characterize the uncertainty surrounding alternative structural assumptions and their contribution to decision uncertainty. A review of decision models commissioned by the NHS Health Technology Programme was undertaken to identify the types of model uncertainties described in the literature. A second review was undertaken to identify approaches to characterise these uncertainties. The assessment of structural uncertainty has received little attention in the health economics literature. A common method to characterise structural uncertainty is to compute results for each alternative model specification, and to present alternative results as scenario analyses. It is then left to decision maker to assess the credibility of the alternative structures in interpreting the range of results. The review of methods to explicitly characterise structural uncertainty identified two methods: 1) model averaging, where alternative models, with different specifications, are built, and their results averaged, using explicit prior distributions often based on expert opinion and 2) Model selection on the basis of prediction performance or goodness of fit. For a number of reasons these methods are neither appropriate nor desirable methods to characterize structural uncertainty in decision analytic models. When faced with a choice between multiple models, another method can be employed which allows structural uncertainty to be explicitly considered and does not ignore potentially relevant model structures. Uncertainty can be directly characterised (or parameterised) in the model itself. This method is analogous to model averaging on individual or sets of model inputs, but also allows the value of information associated with structural uncertainties to be resolved.
Learning in imbalanced relational data
Traditional learning techniques learn from flat data files with the assumption that each class has a similar number of examples. However, the majority of real-world data are stored as relational systems with imbalanced data distribution, where one class of data is over-represented as compared with other classes. We propose to extend a relational learning technique called Probabilistic Relational Models (PRMs) to deal with the imbalanced class problem. We address learning from imbalanced relational data using an ensemble of PRMs and propose a new model: the PRMs-IM. We show the performance of PRMs-IM on a real university relational database to identify students at risk
Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data
Similarity-based approaches represent a promising direction for time series
analysis. However, many such methods rely on parameter tuning, and some have
shortcomings if the time series are multivariate (MTS), due to dependencies
between attributes, or the time series contain missing data. In this paper, we
address these challenges within the powerful context of kernel methods by
proposing the robust \emph{time series cluster kernel} (TCK). The approach
taken leverages the missing data handling properties of Gaussian mixture models
(GMM) augmented with informative prior distributions. An ensemble learning
approach is exploited to ensure robustness to parameters by combining the
clustering results of many GMM to form the final kernel.
We evaluate the TCK on synthetic and real data and compare to other
state-of-the-art techniques. The experimental results demonstrate that the TCK
is robust to parameter choices, provides competitive results for MTS without
missing data and outstanding results for missing data.Comment: 23 pages, 6 figure
Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests
The vast amounts of data collected in various domains pose great challenges
to modern data exploration and analysis. To find "interesting" objects in large
databases, users typically define a query using positive and negative example
objects and train a classification model to identify the objects of interest in
the entire data catalog. However, this approach requires a scan of all the data
to apply the classification model to each instance in the data catalog, making
this method prohibitively expensive to be employed in large-scale databases
serving many users and queries interactively. In this work, we propose a novel
framework for such search-by-classification scenarios that allows users to
interactively search for target objects by specifying queries through a small
set of positive and negative examples. Unlike previous approaches, our
framework can rapidly answer such queries at low cost without scanning the
entire database. Our framework is based on an index-aware construction scheme
for decision trees and random forests that transforms the inference phase of
these classification models into a set of range queries, which in turn can be
efficiently executed by leveraging multidimensional indexing structures. Our
experiments show that queries over large data catalogs with hundreds of
millions of objects can be processed in a few seconds using a single server,
compared to hours needed by classical scanning-based approaches
- …