121 research outputs found
Advances in Bayesian asymptotics and Bayesian nonparametrics
Bayesian statistics is a powerful approach to learning real-world phenomena, its strength lying in its ability to quantify uncertainty explicitly by treating unknown quantities of interest as random variables. In this thesis, we consider questions regarding three quite different aspects of Bayesian learning.
Firstly, we consider approximate Bayesian computation (ABC), a computational method suitable for computing approximate posterior distributions for highly complex models, where the likelihood function is intractable but can be simulated from. Previous authors have proved consistency and provided rates of convergence in the case where all summary statistics converge at the same rate as each other. We generalize to the case where summary statistics may converge at different rates, and provide an explicit representation of the shape of the ABC posterior distribution in our general setting. We also show under our general setting that local linear post-processing can lead to significantly faster contraction rates of the pseudo-posterior.
We then focus on the application of Bayesian statistics to natural language processing. The class of context-free grammars, which are standard in the modelling of natural language, have been shown to be too restrictive to fully describe all features of natural language. We propose a Bayesian non-parametric model for the class of 2-multiple context-free grammars, which generalise context-free grammars. Our model is inspired by previously proposed Bayesian models for context-tree grammars and is based on the hierarchical Dirichlet process. We develop a sequential Monte Carlo algorithm to make inference under this model and carry out simulation studies to assess our method.
Finally, we consider some consistency issues related to Bayesian nonparametric mixture models. It has been shown that these models are inconsistent for the number of clusters. In the case of Dirichlet process (DP) mixture models, this problem can be mitigated when a prior is put on the model's concentration hyperparameter α, as is common practice. We prove that Pitman--Yor process (PYP) mixture models (which generalise DP mixture models) remain inconsistent for the number of clusters when a prior is put on α, in the special case where the true number of components in the data generating mechanism is equal to 1 and the discount parameter σ is a fixed constant. When considering the space over partitions induced by BNP mixture models, point estimators such as the maximum a posteriori (MAP) are commonly used to summarise the posterior clustering structure of such models, which alone can be complex and difficult to interpret. We prove consistency of the MAP partition for DP mixture models when the concentration parameter, α, goes deterministically to zero, and when the true partition is made of only one cluster
Learning with Labeled and Unlabeled Data
In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of input-dependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature review, we try to cover the wide variety of (recent) work and to classify this work into meaningful categories. We also mention work done on related problems and suggest some ideas towards synthesis. Finally, we discuss some caveats and tradeoffs of central importance to the problem
Towards Deeper Understanding in Neuroimaging
Neuroimaging is a growing domain of research, with advances in machine learning having tremendous potential to expand understanding in neuroscience and improve public health. Deep neural networks have recently and rapidly achieved historic success in numerous domains, and as a consequence have completely redefined the landscape of automated learners, giving promise of significant advances in numerous domains of research. Despite recent advances and advantages over traditional machine learning methods, deep neural networks have yet to have permeated significantly into neuroscience studies, particularly as a tool for discovery. This dissertation presents well-established and novel tools for unsupervised learning which aid in feature discovery, with relevant applications to neuroimaging. Through our works within, this dissertation presents strong evidence that deep learning is a viable and important tool for neuroimaging studies
Recommended from our members
Generative Modeling and Inference in Directed and Undirected Neural Networks
Generative modeling and inference are two broad categories in unsupervised learning whose goal is to answer the following questions, respectively: 1. Given a dataset, how do we (either implicitly or explicitly) model the underlying probability distribution from which the data came and draw samples from that distribution? 2. How can we learn an underlying abstract representation of the data? In this dissertation we provide three studies that each in a different way improve upon specific generative modeling and inference techniques. First, we develop a state-of-the-art estimator of a generic probability distribution's partition function, or normalizing constant, during simulated tempering. We then apply our estimator to the specific case of training undirected probabilistic graphical models and find our method able to track log-likelihoods during training at essentially no extra computational cost. We then shift our focus to variational inference in directed probabilistic graphical models (Bayesian networks) for generative modeling and inference. First, we generalize the aggregate prior distribution to decouple the variational and generative models to provide the model with greater flexibility and find improvements in the model's log-likelihood of test data as well as a better latent representation. Finally, we study the variational loss function and argue under a typical architecture the data-dependent term of the gradient decays to zero as the latent space dimensionality increases. We use this result to propose a simple modification to random weight initialization and show in certain models the modification gives rise to substantial improvement in training convergence time. Together, these results improve quantitative performance of popular generative modeling and inference models in addition to furthering our understanding of them
Advances in scalable learning and sampling of unnormalised models
We study probabilistic models that are known incompletely, up to an intractable normalising constant. To reap the full benefit of such models, two
tasks must be solved: learning and sampling. These two tasks have been
subject to decades of research, and yet significant challenges still persist.
Traditional approaches often suffer from poor scalability with respect to
dimensionality and model-complexity, generally rendering them inapplicable to models parameterised by deep neural networks. In this thesis, we
contribute a new set of methods for addressing this scalability problem.
We first explore the problem of learning unnormalised models. Our investigation begins with a well-known learning principle, Noise-contrastive
Estimation, whose underlying mechanism is that of density-ratio estimation.
By examining why existing density-ratio estimators scale poorly, we identify a new framework, telescoping density-ratio estimation (TRE), that can
learn ratios between highly dissimilar densities in high-dimensional spaces.
Our experiments demonstrate that TRE not only yields substantial improvements for the learning of deep unnormalised models, but can do the
same for a broader set of tasks including mutual information estimation and
representation learning.
Subsequently, we explore the problem of sampling unnormalised models.
A large literature on Markov chain Monte Carlo (MCMC) can be leveraged here, and in continuous domains, gradient-based samplers such as
Metropolis-adjusted Langevin algorithm (MALA) and Hamiltonian Monte
Carlo are excellent options. However, there has been substantially less
progress in MCMC for discrete domains. To advance this subfield, we introduce several discrete Metropolis-Hastings samplers that are conceptually
inspired by MALA, and demonstrate their strong empirical performance
across a range of challenging sampling tasks
Parallel MCMC with Generalized Elliptical Slice Sampling
Probabilistic models are conceptually powerful tools for finding structure in
data, but their practical effectiveness is often limited by our ability to
perform inference in them. Exact inference is frequently intractable, so
approximate inference is often performed using Markov chain Monte Carlo (MCMC).
To achieve the best possible results from MCMC, we want to efficiently simulate
many steps of a rapidly mixing Markov chain which leaves the target
distribution invariant. Of particular interest in this regard is how to take
advantage of multi-core computing to speed up MCMC-based inference, both to
improve mixing and to distribute the computational load. In this paper, we
present a parallelizable Markov chain Monte Carlo algorithm for efficiently
sampling from continuous probability distributions that can take advantage of
hundreds of cores. This method shares information between parallel Markov
chains to build a scale-mixture of Gaussians approximation to the density
function of the target distribution. We combine this approximation with a
recent method known as elliptical slice sampling to create a Markov chain with
no step-size parameters that can mix rapidly without requiring gradient or
curvature computations.Comment: 19 pages, 8 figures, 3 algorithm
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning
Connectionist multivariate density-estimation and its application to speech synthesis
Autoregressive models factorize a multivariate joint probability distribution into a
product of one-dimensional conditional distributions. The variables are assigned
an ordering, and the conditional distribution of each variable modelled using all
variables preceding it in that ordering as predictors.
Calculating normalized probabilities and sampling has polynomial computational
complexity under autoregressive models. Moreover, binary autoregressive
models based on neural networks obtain statistical performances similar to that of
some intractable models, like restricted Boltzmann machines, on several datasets.
The use of autoregressive probability density estimators based on neural
networks to model real-valued data, while proposed before, has never been properly
investigated and reported. In this thesis we extend the formulation of neural
autoregressive distribution estimators (NADE) to real-valued data; a model we call
the real-valued neural autoregressive density estimator (RNADE). Its statistical
performance on several datasets, including visual and auditory data, is reported
and compared to that of other models. RNADE obtained higher test likelihoods
than other tractable models, while retaining all the attractive computational
properties of autoregressive models.
However, autoregressive models are limited by the ordering of the variables
inherent to their formulation. Marginalization and imputation tasks can only be
solved analytically if the missing variables are at the end of the ordering. We
present a new training technique that obtains a set of parameters that can be
used for any ordering of the variables. By choosing a model with a convenient
ordering of the dimensions at test time, it is possible to solve any marginalization
and imputation tasks analytically.
The same training procedure also makes it practical to train NADEs and
RNADEs with several hidden layers. The resulting deep and tractable models
display higher test likelihoods than the equivalent one-hidden-layer models for all
the datasets tested.
Ensembles of NADEs or RNADEs can be created inexpensively by combining
models that share their parameters but differ in the ordering of the variables. These
ensembles of autoregressive models obtain state-of-the-art statistical performances
for several datasets.
Finally, we demonstrate the application of RNADE to speech synthesis, and
confirm that capturing the phone-conditional dependencies of acoustic features
improves the quality of synthetic speech. Our model generates synthetic speech
that was judged by naive listeners as being of higher quality than that generated
by mixture density networks, which are considered a state-of-the-art synthesis
techniqu
Support Vector Regression for Non-Stationary Time Series
The difficulty associated with building forecasting models for non-stationary and volatile data has necessitated the development and application of new sophisticated techniques that can handle such data. Interestingly, there are a lot of real-world phenomena where data that are “difficult to analyze” are generated. One of these is the stock market where data series generated are often hard to forecast because of their peculiar characteristics. In particular, the stock market has been referred to as a complex environment and financial time series forecasting is often tagged as the most challenging application of time series forecasting.
In this study, a novel approach known as Support Vector Regression (SVR) for forecasting non-stationary time series was adopted and the feasibility of applying this method to five financial time series was examined. Prior to implementing the SVR algorithm, three different methods of transformation namely Relative Difference in Percentages (RDP), Z-score and Natural Logarithm transformations were applied to the data series and the best prediction results obtained along with the associated transformation technique was presented. Our study indicated that the Z-score transformation is the best scaling method for financial time series, exhibiting superior performance than the other two transformations on the basis of five different performance measures.
To determine the optimum values of the SVR parameters, a cross-validation method was implemented. For this purpose, the value of C and ε was varied from 5 to 100, and 0.001 and 0.1 respectively. The cross-validation method, though computationally expensive, is better than other proposed techniques for determining the values of these parameters.
Another highlight of this study is the comparison of the SVR results to that obtained using 5-day Simple Moving Averages (SMA). The SMA was selected as a comparative method because it has been identified as the most popular quantitative forecasting method used by US corporations. Discussions with financial analysts also suggest that the SMA is one of the widely used in the financial industry. The popularity of the SMA can be explained by the fact that it is easy and cheap to use and it produces forecasts that can be easily interpreted by econometricians and other interested practitioners
- …