17 research outputs found
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
Bayesian Learning of Sum-Product Networks
Sum-product networks (SPNs) are flexible density estimators and have received
significant attention due to their attractive inference properties. While
parameter learning in SPNs is well developed, structure learning leaves
something to be desired: Even though there is a plethora of SPN structure
learners, most of them are somewhat ad-hoc and based on intuition rather than a
clear learning principle. In this paper, we introduce a well-principled
Bayesian framework for SPN structure learning. First, we decompose the problem
into i) laying out a computational graph, and ii) learning the so-called scope
function over the graph. The first is rather unproblematic and akin to neural
network architecture validation. The second represents the effective structure
of the SPN and needs to respect the usual structural constraints in SPN, i.e.
completeness and decomposability. While representing and learning the scope
function is somewhat involved in general, in this paper, we propose a natural
parametrisation for an important and widely used special case of SPNs. These
structural parameters are incorporated into a Bayesian model, such that
simultaneous structure and parameter learning is cast into monolithic Bayesian
posterior inference. In various experiments, our Bayesian SPNs often improve
test likelihoods over greedy SPN learners. Further, since the Bayesian
framework protects against overfitting, we can evaluate hyper-parameters
directly on the Bayesian model score, waiving the need for a separate
validation set, which is especially beneficial in low data regimes. Bayesian
SPNs can be applied to heterogeneous domains and can easily be extended to
nonparametric formulations. Moreover, our Bayesian approach is the first, which
consistently and robustly learns SPN structures under missing data.Comment: NeurIPS 2019; See conference page for supplemen
Bayesian Structure and Parameter Learning of Sum-Product Networks
Sum-product networks (SPN) are graphical models capable of handling large amount of multi-
dimensional data. Unlike many other graphical models, SPNs are tractable if certain structural
requirements are fulfilled; a model is called tractable if probabilistic inference can be performed in
a polynomial time with respect to the size of the model. The learning of SPNs can be separated
into two modes, parameter and structure learning. Many earlier approaches to SPN learning have
treated the two modes as separate, but it has been found that by alternating between these two
modes, good results can be achieved. One example of this kind of algorithm was presented by
Trapp et al. in an article Bayesian Learning of Sum-Product Networks (NeurIPS, 2019).
This thesis discusses SPNs and a Bayesian learning algorithm developed based on the earlier men-
tioned algorithm, differing in some of the used methods. The algorithm by Trapp et al. uses Gibbs
sampling in the parameter learning phase, whereas here Metropolis-Hasting MCMC is used. The
algorithm developed for this thesis was used in two experiments, with a small and simple SPN and
with a larger and more complex SPN. Also, the effect of the data set size and the complexity of
the data was explored. The results were compared to the results got from running the original
algorithm developed by Trapp et al.
The results show that having more data in the learning phase makes the results more accurate as
it is easier for the model to spot patterns from a larger set of data. It was also shown that the
model was able to learn the parameters in the experiments if the data were simple enough, in other
words, if the dimensions of the data contained only one distribution per dimension. In the case
of more complex data, where there were multiple distributions per dimension, the struggle of the
computation was seen from the results
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Data cleaning is naturally framed as probabilistic inference in a generative
model, combining a prior distribution over ground-truth databases with a
likelihood that models the noisy channel by which the data are filtered,
corrupted, and joined to yield incomplete, dirty, and denormalized datasets.
Based on this view, we present PClean, a unified generative modeling
architecture for cleaning and normalizing dirty data in diverse domains. Given
an unclean dataset and a probabilistic program encoding relevant domain
knowledge, PClean learns a structured representation of the data as a
relational database of interrelated objects, and uses this latent structure to
impute missing values, identify duplicates, detect errors, and propose
corrections in the original data table. PClean makes three modeling and
inference contributions: (i) a domain-general non-parametric generative model
of relational data, for inferring latent objects and their network of latent
connections; (ii) a domain-specific probabilistic programming language, for
encoding domain knowledge specific to each dataset being cleaned; and (iii) a
domain-general inference engine that adapts to each PClean program by
constructing data-driven proposals used in sequential Monte Carlo and particle
Gibbs. We show empirically that short (< 50-line) PClean programs deliver
higher accuracy than state-of-the-art data cleaning systems based on machine
learning and weighted logic; that PClean's inference algorithm is faster than
generic particle Gibbs inference for probabilistic programs; and that PClean
scales to large real-world datasets with millions of rows.Comment: Added references; revised abstrac
Tractable Probabilistic Graph Representation Learning with Graph-Induced Sum-Product Networks
We introduce Graph-Induced Sum-Product Networks (GSPNs), a new probabilistic
framework for graph representation learning that can tractably answer
probabilistic queries. Inspired by the computational trees induced by vertices
in the context of message-passing neural networks, we build hierarchies of
sum-product networks (SPNs) where the parameters of a parent SPN are learnable
transformations of the a-posterior mixing probabilities of its children's sum
units. Due to weight sharing and the tree-shaped computation graphs of GSPNs,
we obtain the efficiency and efficacy of deep graph networks with the
additional advantages of a probabilistic model. We show the model's
competitiveness on scarce supervision scenarios, under missing data, and for
graph classification in comparison to popular neural models. We complement the
experiments with qualitative analyses on hyper-parameters and the model's
ability to answer probabilistic queries.Comment: The 12th International Conference on Learning Representations (ICLR
2024