121 research outputs found

    Advances in Bayesian asymptotics and Bayesian nonparametrics

    Get PDF
    Bayesian statistics is a powerful approach to learning real-world phenomena, its strength lying in its ability to quantify uncertainty explicitly by treating unknown quantities of interest as random variables. In this thesis, we consider questions regarding three quite different aspects of Bayesian learning. Firstly, we consider approximate Bayesian computation (ABC), a computational method suitable for computing approximate posterior distributions for highly complex models, where the likelihood function is intractable but can be simulated from. Previous authors have proved consistency and provided rates of convergence in the case where all summary statistics converge at the same rate as each other. We generalize to the case where summary statistics may converge at different rates, and provide an explicit representation of the shape of the ABC posterior distribution in our general setting. We also show under our general setting that local linear post-processing can lead to significantly faster contraction rates of the pseudo-posterior. We then focus on the application of Bayesian statistics to natural language processing. The class of context-free grammars, which are standard in the modelling of natural language, have been shown to be too restrictive to fully describe all features of natural language. We propose a Bayesian non-parametric model for the class of 2-multiple context-free grammars, which generalise context-free grammars. Our model is inspired by previously proposed Bayesian models for context-tree grammars and is based on the hierarchical Dirichlet process. We develop a sequential Monte Carlo algorithm to make inference under this model and carry out simulation studies to assess our method. Finally, we consider some consistency issues related to Bayesian nonparametric mixture models. It has been shown that these models are inconsistent for the number of clusters. In the case of Dirichlet process (DP) mixture models, this problem can be mitigated when a prior is put on the model's concentration hyperparameter α, as is common practice. We prove that Pitman--Yor process (PYP) mixture models (which generalise DP mixture models) remain inconsistent for the number of clusters when a prior is put on α, in the special case where the true number of components in the data generating mechanism is equal to 1 and the discount parameter σ is a fixed constant. When considering the space over partitions induced by BNP mixture models, point estimators such as the maximum a posteriori (MAP) are commonly used to summarise the posterior clustering structure of such models, which alone can be complex and difficult to interpret. We prove consistency of the MAP partition for DP mixture models when the concentration parameter, α, goes deterministically to zero, and when the true partition is made of only one cluster

    Learning with Labeled and Unlabeled Data

    Get PDF
    In this paper, on the one hand, we aim to give a review on literature dealing with the problem of supervised learning aided by additional unlabeled data. On the other hand, being a part of the author's first year PhD report, the paper serves as a frame to bundle related work by the author as well as numerous suggestions for potential future work. Therefore, this work contains more speculative and partly subjective material than the reader might expect from a literature review. We give a rigorous definition of the problem and relate it to supervised and unsupervised learning. The crucial role of prior knowledge is put forward, and we discuss the important notion of input-dependent regularization. We postulate a number of baseline methods, being algorithms or algorithmic schemes which can more or less straightforwardly be applied to the problem, without the need for genuinely new concepts. However, some of them might serve as basis for a genuine method. In the literature review, we try to cover the wide variety of (recent) work and to classify this work into meaningful categories. We also mention work done on related problems and suggest some ideas towards synthesis. Finally, we discuss some caveats and tradeoffs of central importance to the problem

    Towards Deeper Understanding in Neuroimaging

    Get PDF
    Neuroimaging is a growing domain of research, with advances in machine learning having tremendous potential to expand understanding in neuroscience and improve public health. Deep neural networks have recently and rapidly achieved historic success in numerous domains, and as a consequence have completely redefined the landscape of automated learners, giving promise of significant advances in numerous domains of research. Despite recent advances and advantages over traditional machine learning methods, deep neural networks have yet to have permeated significantly into neuroscience studies, particularly as a tool for discovery. This dissertation presents well-established and novel tools for unsupervised learning which aid in feature discovery, with relevant applications to neuroimaging. Through our works within, this dissertation presents strong evidence that deep learning is a viable and important tool for neuroimaging studies

    Advances in scalable learning and sampling of unnormalised models

    Get PDF
    We study probabilistic models that are known incompletely, up to an intractable normalising constant. To reap the full benefit of such models, two tasks must be solved: learning and sampling. These two tasks have been subject to decades of research, and yet significant challenges still persist. Traditional approaches often suffer from poor scalability with respect to dimensionality and model-complexity, generally rendering them inapplicable to models parameterised by deep neural networks. In this thesis, we contribute a new set of methods for addressing this scalability problem. We first explore the problem of learning unnormalised models. Our investigation begins with a well-known learning principle, Noise-contrastive Estimation, whose underlying mechanism is that of density-ratio estimation. By examining why existing density-ratio estimators scale poorly, we identify a new framework, telescoping density-ratio estimation (TRE), that can learn ratios between highly dissimilar densities in high-dimensional spaces. Our experiments demonstrate that TRE not only yields substantial improvements for the learning of deep unnormalised models, but can do the same for a broader set of tasks including mutual information estimation and representation learning. Subsequently, we explore the problem of sampling unnormalised models. A large literature on Markov chain Monte Carlo (MCMC) can be leveraged here, and in continuous domains, gradient-based samplers such as Metropolis-adjusted Langevin algorithm (MALA) and Hamiltonian Monte Carlo are excellent options. However, there has been substantially less progress in MCMC for discrete domains. To advance this subfield, we introduce several discrete Metropolis-Hastings samplers that are conceptually inspired by MALA, and demonstrate their strong empirical performance across a range of challenging sampling tasks

    Parallel MCMC with Generalized Elliptical Slice Sampling

    Get PDF
    Probabilistic models are conceptually powerful tools for finding structure in data, but their practical effectiveness is often limited by our ability to perform inference in them. Exact inference is frequently intractable, so approximate inference is often performed using Markov chain Monte Carlo (MCMC). To achieve the best possible results from MCMC, we want to efficiently simulate many steps of a rapidly mixing Markov chain which leaves the target distribution invariant. Of particular interest in this regard is how to take advantage of multi-core computing to speed up MCMC-based inference, both to improve mixing and to distribute the computational load. In this paper, we present a parallelizable Markov chain Monte Carlo algorithm for efficiently sampling from continuous probability distributions that can take advantage of hundreds of cores. This method shares information between parallel Markov chains to build a scale-mixture of Gaussians approximation to the density function of the target distribution. We combine this approximation with a recent method known as elliptical slice sampling to create a Markov chain with no step-size parameters that can mix rapidly without requiring gradient or curvature computations.Comment: 19 pages, 8 figures, 3 algorithm

    Representation Learning: A Review and New Perspectives

    Full text link
    The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning

    Connectionist multivariate density-estimation and its application to speech synthesis

    Get PDF
    Autoregressive models factorize a multivariate joint probability distribution into a product of one-dimensional conditional distributions. The variables are assigned an ordering, and the conditional distribution of each variable modelled using all variables preceding it in that ordering as predictors. Calculating normalized probabilities and sampling has polynomial computational complexity under autoregressive models. Moreover, binary autoregressive models based on neural networks obtain statistical performances similar to that of some intractable models, like restricted Boltzmann machines, on several datasets. The use of autoregressive probability density estimators based on neural networks to model real-valued data, while proposed before, has never been properly investigated and reported. In this thesis we extend the formulation of neural autoregressive distribution estimators (NADE) to real-valued data; a model we call the real-valued neural autoregressive density estimator (RNADE). Its statistical performance on several datasets, including visual and auditory data, is reported and compared to that of other models. RNADE obtained higher test likelihoods than other tractable models, while retaining all the attractive computational properties of autoregressive models. However, autoregressive models are limited by the ordering of the variables inherent to their formulation. Marginalization and imputation tasks can only be solved analytically if the missing variables are at the end of the ordering. We present a new training technique that obtains a set of parameters that can be used for any ordering of the variables. By choosing a model with a convenient ordering of the dimensions at test time, it is possible to solve any marginalization and imputation tasks analytically. The same training procedure also makes it practical to train NADEs and RNADEs with several hidden layers. The resulting deep and tractable models display higher test likelihoods than the equivalent one-hidden-layer models for all the datasets tested. Ensembles of NADEs or RNADEs can be created inexpensively by combining models that share their parameters but differ in the ordering of the variables. These ensembles of autoregressive models obtain state-of-the-art statistical performances for several datasets. Finally, we demonstrate the application of RNADE to speech synthesis, and confirm that capturing the phone-conditional dependencies of acoustic features improves the quality of synthetic speech. Our model generates synthetic speech that was judged by naive listeners as being of higher quality than that generated by mixture density networks, which are considered a state-of-the-art synthesis techniqu

    Support Vector Regression for Non-Stationary Time Series

    Get PDF
    The difficulty associated with building forecasting models for non-stationary and volatile data has necessitated the development and application of new sophisticated techniques that can handle such data. Interestingly, there are a lot of real-world phenomena where data that are “difficult to analyze” are generated. One of these is the stock market where data series generated are often hard to forecast because of their peculiar characteristics. In particular, the stock market has been referred to as a complex environment and financial time series forecasting is often tagged as the most challenging application of time series forecasting. In this study, a novel approach known as Support Vector Regression (SVR) for forecasting non-stationary time series was adopted and the feasibility of applying this method to five financial time series was examined. Prior to implementing the SVR algorithm, three different methods of transformation namely Relative Difference in Percentages (RDP), Z-score and Natural Logarithm transformations were applied to the data series and the best prediction results obtained along with the associated transformation technique was presented. Our study indicated that the Z-score transformation is the best scaling method for financial time series, exhibiting superior performance than the other two transformations on the basis of five different performance measures. To determine the optimum values of the SVR parameters, a cross-validation method was implemented. For this purpose, the value of C and ε was varied from 5 to 100, and 0.001 and 0.1 respectively. The cross-validation method, though computationally expensive, is better than other proposed techniques for determining the values of these parameters. Another highlight of this study is the comparison of the SVR results to that obtained using 5-day Simple Moving Averages (SMA). The SMA was selected as a comparative method because it has been identified as the most popular quantitative forecasting method used by US corporations. Discussions with financial analysts also suggest that the SMA is one of the widely used in the financial industry. The popularity of the SMA can be explained by the fact that it is easy and cheap to use and it produces forecasts that can be easily interpreted by econometricians and other interested practitioners
    corecore