Search CORE

1,944 research outputs found

Deriving Probabilistic Databases with Inference Ensembles

Author: Davidson Susan B
Milo Tova
Stoyanovich Julia
Tannen Val
Publication venue: ScholarlyCommons
Publication date: 01/01/2011
Field of study

Many real-world applications deal with uncertain or missing data, prompting a surge of activity in the area of probabilistic databases. A shortcoming of prior work is the assumption that an appropriate probabilistic model, along with the necessary probability distributions, is given. We address this shortcoming by presenting a framework for learning a set of inference ensembles, termed meta-rule semi-lattices, or MRSL, from the complete portion of the data. We use the MRSL to infer probability distributions for missing data, and demonstrate experimentally that high accuracy is achieved when a single attribute value is missing per tuple. We next propose an inference algorithm based on Gibbs sampling that accurately predicts the probability distribution for multiple missing values. We also develop an optimization that greatly improves performance of multi-attribute inference for collections of tuples, while maintaining high accuracy. Finally, we develop an experimental framework to evaluate the efficiency and accuracy of our approach

Crossref

ScholarlyCommons@Penn

Learning Tuple Probabilities

Author: Dylla Maximilian
Theobald Martin
Publication venue
Publication date: 01/01/2016
Field of study

Learning the parameters of complex probabilistic-relational models from labeled training data is a standard technique in machine learning, which has been intensively studied in the subfield of Statistical Relational Learning (SRL), but---so far---this is still an under-investigated topic in the context of Probabilistic Databases (PDBs). In this paper, we focus on learning the probability values of base tuples in a PDB from labeled lineage formulas. The resulting learning problem can be viewed as the inverse problem to confidence computations in PDBs: given a set of labeled query answers, learn the probability values of the base tuples, such that the marginal probabilities of the query answers again yield in the assigned probability labels. We analyze the learning problem from a theoretical perspective, cast it into an optimization problem, and provide an algorithm based on stochastic gradient descent. Finally, we conclude by an experimental evaluation on three real-world and one synthetic dataset, thus comparing our approach to various techniques from SRL, reasoning in information extraction, and optimization

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

Defining and characterising structural uncertainty in decision analytic models

Author: Karl Claxton
Laura Bojke
Mark Sculpher
Stephen Palmer
Publication venue
Publication date
Field of study

An inappropriate structure for a decision analytic model can potentially invalidate estimates of cost-effectiveness and estimates of the value of further research. However, there are often a number of alternative and credible structural assumptions which can be made. Although it is common practice to acknowledge potential limitations in model structure, there is a lack of clarity about methods to characterize the uncertainty surrounding alternative structural assumptions and their contribution to decision uncertainty. A review of decision models commissioned by the NHS Health Technology Programme was undertaken to identify the types of model uncertainties described in the literature. A second review was undertaken to identify approaches to characterise these uncertainties. The assessment of structural uncertainty has received little attention in the health economics literature. A common method to characterise structural uncertainty is to compute results for each alternative model specification, and to present alternative results as scenario analyses. It is then left to decision maker to assess the credibility of the alternative structures in interpreting the range of results. The review of methods to explicitly characterise structural uncertainty identified two methods: 1) model averaging, where alternative models, with different specifications, are built, and their results averaged, using explicit prior distributions often based on expert opinion and 2) Model selection on the basis of prediction performance or goodness of fit. For a number of reasons these methods are neither appropriate nor desirable methods to characterize structural uncertainty in decision analytic models. When faced with a choice between multiple models, another method can be employed which allows structural uncertainty to be explicitly considered and does not ignore potentially relevant model structures. Uncertainty can be directly characterised (or parameterised) in the model itself. This method is analogous to model averaging on individual or sets of model inputs, but also allows the value of information associated with structural uncertainties to be resolved.

Research Papers in Economics

Learning in imbalanced relational data

Author: Ghanem Amal S.
Venkatesh Svetha
West Geoff
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Traditional learning techniques learn from flat data files with the assumption that each class has a similar number of examples. However, the majority of real-world data are stored as relational systems with imbalanced data distribution, where one class of data is over-represented as compared with other classes. We propose to extend a relational learning technique called Probabilistic Relational Models (PRMs) to deal with the imbalanced class problem. We address learning from imbalanced relational data using an ensemble of PRMs and propose a new model: the PRMs-IM. We show the performance of PRMs-IM on a real university relational database to identify students at risk

Deakin Research Online

Crossref

espace@Curtin

Time Series Cluster Kernel for Learning Similarities between Multivariate Time Series with Missing Data

Author: Bianchi Filippo Maria
Jenssen Robert
Mikalsen Karl Øyvind
Soguero-Ruiz Cristina
Publication venue
Publication date: 01/01/2017
Field of study

Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series contain missing data. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust \emph{time series cluster kernel} (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to form the final kernel. We evaluate the TCK on synthetic and real data and compare to other state-of-the-art techniques. The experimental results demonstrate that the TCK is robust to parameter choices, provides competitive results for MTS without missing data and outstanding results for missing data.Comment: 23 pages, 6 figure

arXiv.org e-Print Archive

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

Author: Gieseke Fabian
Lülf Christian
Martins Denis Mayr Lima
Salles Marcos Antonio Vaz
Zhou Yongluan
Publication venue
Publication date: 31/07/2023
Field of study

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches

arXiv.org e-Print Archive