38,635 research outputs found

    Bayesian methods for non-gaussian data modeling and applications

    Get PDF
    Finite mixture models are among the most useful machine learning techniques and are receiving considerable attention in various applications. The use of finite mixture models in image and signal processing has proved to be of considerable interest in terms of both theoretical development and in their usefulness in several applications. In most of the applications, the Gaussian density is used in the mixture modeling of data. Although a Gaussian mixture may provide a reasonable approximation to many real-world distributions, it is certainly not always the best approximation especially in image and signal processing applications where we often deal with non-Gaussian data. In this thesis, we propose two novel approaches that may be used in modeling non-Gaussian data. These approaches use two highly flexible distributions, the generalized Gaussian distribution (GGD) and the general Beta distribution, in order to model the data. We are motivated by the fact that these distributions are able to fit many distributional shapes and then can be considered as a useful class of flexible models to address several problems and applications involving measurements and features having well-known marked deviation from the Gaussian shape. For the mixture estimation and selection problem, researchers have demonstrated that Bayesian approaches are fully optimal. The Bayesian learning allows the incorporation of prior knowledge in a formal coherent way that avoids overfitting problems. For this reason, we adopt different Bayesian approaches in order to learn our models parameters. First, we present a fully Bayesian approach to analyze finite generalized Gaussian mixture models which incorporate several standard mixtures, such as Laplace and Gaussian. This approach evaluates the posterior distribution and Bayes estimators using a Gibbs sampling algorithm, and selects the number of components in the mixture using the integrated likelihood. We also propose a fully Bayesian approach for finite Beta mixtures learning using a Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique which simultaneously allows cluster assignments, parameters estimation, and the selection of the optimal number of clusters. We then validate the proposed methods by applying them to different image processing applications

    Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures

    Get PDF
    Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data

    Sequential Gaussian Processes for Online Learning of Nonstationary Functions

    Full text link
    Many machine learning problems can be framed in the context of estimating functions, and often these are time-dependent functions that are estimated in real-time as observations arrive. Gaussian processes (GPs) are an attractive choice for modeling real-valued nonlinear functions due to their flexibility and uncertainty quantification. However, the typical GP regression model suffers from several drawbacks: i) Conventional GP inference scales O(N3)O(N^{3}) with respect to the number of observations; ii) updating a GP model sequentially is not trivial; and iii) covariance kernels often enforce stationarity constraints on the function, while GPs with non-stationary covariance kernels are often intractable to use in practice. To overcome these issues, we propose an online sequential Monte Carlo algorithm to fit mixtures of GPs that capture non-stationary behavior while allowing for fast, distributed inference. By formulating hyperparameter optimization as a multi-armed bandit problem, we accelerate mixing for real time inference. Our approach empirically improves performance over state-of-the-art methods for online GP estimation in the context of prediction for simulated non-stationary data and hospital time series data

    The variational Bayesian approach to fitting mixture models to circular wave direction data

    Get PDF
    The emerging variational Bayesian (VB) technique for approximate Bayesian statistical inference is a nonsimulation- based and time-efficient approach. It provides a useful, practical alternative to other Bayesian statistical approaches such as Markov chain Monte Carlo–based techniques, particularly for applications involving large datasets. This article reviews the increasingly popular VB statistical approach and illustrates how it can be used to fit Gaussian mixture models to circular wave direction data. This is done by taking the straightforward approach of padding the data; this method involves adding a repeat of a complete cycle of the data to the existing dataset to obtain a dataset on the real line. The padded dataset can then be analyzed using the standard VB technique. This results in a practical, efficient approach that is also appropriate for modeling other types of circular, or directional, data such as wind direction

    Clustering South African households based on their asset status using latent variable models

    Full text link
    The Agincourt Health and Demographic Surveillance System has since 2001 conducted a biannual household asset survey in order to quantify household socio-economic status (SES) in a rural population living in northeast South Africa. The survey contains binary, ordinal and nominal items. In the absence of income or expenditure data, the SES landscape in the study population is explored and described by clustering the households into homogeneous groups based on their asset status. A model-based approach to clustering the Agincourt households, based on latent variable models, is proposed. In the case of modeling binary or ordinal items, item response theory models are employed. For nominal survey items, a factor analysis model, similar in nature to a multinomial probit model, is used. Both model types have an underlying latent variable structure - this similarity is exploited and the models are combined to produce a hybrid model capable of handling mixed data types. Further, a mixture of the hybrid models is considered to provide clustering capabilities within the context of mixed binary, ordinal and nominal response data. The proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD). The MFA-MD model is applied to the survey data to cluster the Agincourt households into homogeneous groups. The model is estimated within the Bayesian paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings result, providing insight to the different socio-economic strata within the Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore