108 research outputs found
Learning Logistic Circuits
This paper proposes a new classification model called logistic circuits. On
MNIST and Fashion datasets, our learning algorithm outperforms neural networks
that have an order of magnitude more parameters. Yet, logistic circuits have a
distinct origin in symbolic AI, forming a discriminative counterpart to
probabilistic-logical circuits such as ACs, SPNs, and PSDDs. We show that
parameter learning for logistic circuits is convex optimization, and that a
simple local search algorithm can induce strong model structures from data.Comment: Published in the Proceedings of the Thirty-Third AAAI Conference on
Artificial Intelligence (AAAI19
Joints in Random Forests
Decision Trees (DTs) and Random Forests (RFs) are powerful discriminative learners and tools of central importance to the everyday machine learning practitioner and data scientist. Due to their discriminative nature, however, they lack principled methods to process inputs with missing features or to detect outliers, which requires pairing them with imputation techniques or a separate generative model. In this paper, we demonstrate that DTs and RFs can naturally be interpreted as generative models, by drawing a connection to Probabilistic Circuits, a prominent class of tractable probabilistic models. This reinterpretation equips them with a full joint distribution over the feature space and leads to Generative Decision Trees (GeDTs) and Generative Forests (GeFs), a family of novel hybrid generative-discriminative models. This family of models retains the overall characteristics of DTs and RFs while additionally being able to handle missing features by means of marginalisation. Under certain assumptions, frequently made for Bayes consistency results, we show that consistency in GeDTs and GeFs extend to any pattern of missing input features, if missing at random. Empirically, we show that our models often outperform common routines to treat missing data, such as K-nearest neighbour imputation, and moreover, that our models can naturally detect outliers by monitoring the marginal probability of input features
Conditional Sum-Product Networks: Imposing Structure on Deep Probabilistic Architectures
Probabilistic graphical models are a central tool in AI; however, they are
generally not as expressive as deep neural models, and inference is notoriously
hard and slow. In contrast, deep probabilistic models such as sum-product
networks (SPNs) capture joint distributions in a tractable fashion, but still
lack the expressive power of intractable models based on deep neural networks.
Therefore, we introduce conditional SPNs (CSPNs), conditional density
estimators for multivariate and potentially hybrid domains which allow
harnessing the expressive power of neural networks while still maintaining
tractability guarantees. One way to implement CSPNs is to use an existing SPN
structure and condition its parameters on the input, e.g., via a deep neural
network. This approach, however, might misrepresent the conditional
independence structure present in data. Consequently, we also develop a
structure-learning approach that derives both the structure and parameters of
CSPNs from data. Our experimental evidence demonstrates that CSPNs are
competitive with other probabilistic models and yield superior performance on
multilabel image classification compared to mean field and mixture density
networks. Furthermore, they can successfully be employed as building blocks for
structured probabilistic models, such as autoregressive image models.Comment: 13 pages, 6 figure
Group Fairness by Probabilistic Modeling with Latent Fair Decisions
Machine learning systems are increasingly being used to make impactful
decisions such as loan applications and criminal justice risk assessments, and
as such, ensuring fairness of these systems is critical. This is often
challenging as the labels in the data are biased. This paper studies learning
fair probability distributions from biased data by explicitly modeling a latent
variable that represents a hidden, unbiased label. In particular, we aim to
achieve demographic parity by enforcing certain independencies in the learned
model. We also show that group fairness guarantees are meaningful only if the
distribution used to provide those guarantees indeed captures the real-world
data. In order to closely model the data distribution, we employ probabilistic
circuits, an expressive and tractable probabilistic model, and propose an
algorithm to learn them from incomplete data. We evaluate our approach on a
synthetic dataset in which observed labels indeed come from fair labels but
with added bias, and demonstrate that the fair labels are successfully
retrieved. Moreover, we show on real-world datasets that our approach not only
is a better model than existing methods of how the data was generated but also
achieves competitive accuracy
Joints in Random Forests
Decision Trees (DTs) and Random Forests (RFs) are powerful discriminative
learners and tools of central importance to the everyday machine learning
practitioner and data scientist. Due to their discriminative nature, however,
they lack principled methods to process inputs with missing features or to
detect outliers, which requires pairing them with imputation techniques or a
separate generative model. In this paper, we demonstrate that DTs and RFs can
naturally be interpreted as generative models, by drawing a connection to
Probabilistic Circuits, a prominent class of tractable probabilistic models.
This reinterpretation equips them with a full joint distribution over the
feature space and leads to Generative Decision Trees (GeDTs) and Generative
Forests (GeFs), a family of novel hybrid generative-discriminative models. This
family of models retains the overall characteristics of DTs and RFs while
additionally being able to handle missing features by means of marginalisation.
Under certain assumptions, frequently made for Bayes consistency results, we
show that consistency in GeDTs and GeFs extend to any pattern of missing input
features, if missing at random. Empirically, we show that our models often
outperform common routines to treat missing data, such as K-nearest neighbour
imputation, and moreover, that our models can naturally detect outliers by
monitoring the marginal probability of input features
The Circle of Meaning: From Translation to Paraphrasing and Back
The preservation of meaning between inputs and outputs is perhaps
the most ambitious and, often, the most elusive goal of systems
that attempt to process natural language. Nowhere is this goal of
more obvious importance than for the tasks of machine translation
and paraphrase generation. Preserving meaning between the input and
the output is paramount for both, the monolingual vs bilingual distinction
notwithstanding. In this thesis, I present a novel, symbiotic relationship
between these two tasks that I term the "circle of meaning''.
Today's statistical machine translation (SMT) systems require high
quality human translations for parameter tuning, in addition to
large bi-texts for learning the translation units. This parameter
tuning usually involves generating translations at different points
in the parameter space and obtaining feedback against human-authored
reference translations as to how good the translations. This feedback
then dictates what point in the parameter space should be explored
next. To measure this feedback, it is generally considered wise to have
multiple (usually 4) reference translations to avoid unfair penalization of translation
hypotheses which could easily happen given the large number of ways in which
a sentence can be translated from one language to another. However, this reliance on multiple reference translations
creates a problem since they are labor intensive and expensive to obtain.
Therefore, most current MT datasets only contain a single reference.
This leads to the problem of reference sparsity---the primary open problem
that I address in this dissertation---one that has a serious effect on the
SMT parameter tuning process.
Bannard and Callison-Burch (2005) were the first to provide a practical
connection between phrase-based statistical machine translation and paraphrase
generation. However, their technique is restricted to generating phrasal
paraphrases. I build upon their approach and augment a phrasal paraphrase
extractor into a sentential paraphraser with extremely broad coverage.
The novelty in this augmentation lies in the further strengthening of
the connection between statistical machine translation and paraphrase
generation; whereas Bannard and Callison-Burch only relied on SMT machinery
to extract phrasal paraphrase rules and stopped there, I take it a few
steps further and build a full English-to-English SMT system. This system
can, as expected, ``translate'' any English input sentence into a new English
sentence with the same degree of meaning preservation that exists in a bilingual
SMT system. In fact, being a state-of-the-art SMT system, it is able to generate
n-best "translations" for any given input sentence. This sentential
paraphraser, built almost entirely from existing SMT machinery, represents
the first 180 degrees of the circle of meaning.
To complete the circle, I describe a novel connection in the other direction.
I claim that the sentential paraphraser, once built in this fashion, can
provide a solution to the reference sparsity problem and, hence, be used
to improve the performance a bilingual SMT system. I discuss two different
instantiations of the sentential paraphraser and show several results that
provide empirical validation for this connection
Tractable probabilistic models for causal learning and reasoning
This thesis examines the application of tractable probabilistic modelling principles to causal learning and reasoning. Tractable probabilistic modelling is a promising paradigm that has emerged in recent years, which focuses on probabilistic models that enable exact and efficient probabilistic reasoning. In particular, the framework of probabilistic circuits provides a systematic language of the tractability of models for various inference queries based on their structural properties, with recent proposals pushing the boundaries of expressiveness and tractability. However, not all information about a system can be captured through a probability distribution over observed variables; for example, the causal direction between two variables can be indistinguishable from data alone. Formalizing this, Pearl’s Causal Hierarchy (also known as the information hierarchy) delineates three levels of causal queries, namely, associational, interventional, and counterfactual, that require increasingly greater knowledge of the underlying causal system, represented by a structural causal model and associated causal diagram. Motivated by this, we investigate the possibility of tractable causal modelling; that is, exact and efficient reasoning with respect to classes of causal queries. In particular, we identify three scenarios, separated by the amount of knowledge available to the modeler: namely, when the full causal diagram/model is available, when only the observational distribution and identifiable causal estimand are available, and when there is additionally uncertainty over the causal diagram. In each of the scenarios, we propose probabilistic circuit representations, structural properties, and algorithms that enable efficient and exact causal reasoning. These models are distinguished from tractable probabilistic models in that they can not only answer different probabilistic inference queries, but also causal queries involving different interventions and even different causal diagrams. However, we also identify key limitations that cast doubt on the existence of a fully general tractable causal model. Our contributions also extend the theory of probabilistic circuits by proposing new properties and circuit architectures, which enable the analysis of advanced inference queries including, but not limited to, causal inference estimands
Recommended from our members
The Syntactic Bits of Nouns: How Prior Syntactic Distributions Affect Comprehension, Production, and Acquisition
Usage-based linguistic theory argues that experience is the fundamental organizing principle of language. Linguistic representations are extracted from – and continuously tuned by – probabilistic features of language use. Much psycholinguistic evidence supports this argument, particularly in the domain of lexical processing. For example, how a word is distributed across its various lexical and morphological contexts influences how quickly it is recognized and produced in isolation. Fewer studies have explored how syntactic distributions affect lexical processing, and of these, even fewer have adopted comprehensive, abstract measurements of syntax. In this dissertation, I present several new information-theoretic tools for measuring the syntactic distributions of words based on the Dependency Grammar formalism. This formalism allows me to contrast two independent dimensions of syntactic structure: hierarchical status and word order. Further, I provide a new method for teasing apart information bound to syntactic and lexical contexts. I compute these measures for nouns based on two large corpora of English.These measures are correlated with behavior in several contexts. First, I re-analyze the noun-based trials of two previously published databases of visual lexical decision response time data, one simple and the other primed. I then turn to production, reporting two picture-naming studies. In the first, participants produce nouns in isolation. This task consitutes a stong attack on the hypothesis that syntactic distributions affect noun production; at least on its face, it does not require participants to access syntactic information in order to successfully complete the task. In a follow up, participants were asked to name the images using a syntactic frame (the + NAME). This task should promote syntactic access, increasing the likelihood that prior syntactic distributions should play a role. Finally, I test whether children are senstive to these syntactic distributions (based on adult speech) as they begin to produce nouns in syntactic contexts for the first time using a large, densely sampled longitudinal corpus of child speech.Results show that isolated noun processing is affected by prior syntactic distributions in both comprehension and production. However, the specific nature of these effects differs across modalities, and in production, as a function of whether the nouns were produced in isolation or within a syntactic frame. The measures also predict the age at which nouns first emerge in the speech of children
Probabilistic Modelling of Morphologically Rich Languages
This thesis investigates how the sub-structure of words can be accounted for
in probabilistic models of language. Such models play an important role in
natural language processing tasks such as translation or speech recognition,
but often rely on the simplistic assumption that words are opaque symbols. This
assumption does not fit morphologically complex language well, where words can
have rich internal structure and sub-word elements are shared across distinct
word forms.
Our approach is to encode basic notions of morphology into the assumptions of
three different types of language models, with the intention that leveraging
shared sub-word structure can improve model performance and help overcome data
sparsity that arises from morphological processes.
In the context of n-gram language modelling, we formulate a new Bayesian
model that relies on the decomposition of compound words to attain better
smoothing, and we develop a new distributed language model that learns vector
representations of morphemes and leverages them to link together
morphologically related words. In both cases, we show that accounting for word
sub-structure improves the models' intrinsic performance and provides benefits
when applied to other tasks, including machine translation.
We then shift the focus beyond the modelling of word sequences and consider
models that automatically learn what the sub-word elements of a given language
are, given an unannotated list of words. We formulate a novel model that can
learn discontiguous morphemes in addition to the more conventional contiguous
morphemes that most previous models are limited to. This approach is
demonstrated on Semitic languages, and we find that modelling discontiguous
sub-word structures leads to improvements in the task of segmenting words into
their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014.
http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c
- …