38,635 research outputs found
Bayesian methods for non-gaussian data modeling and applications
Finite mixture models are among the most useful machine learning techniques and are receiving considerable attention in various applications. The use of finite mixture models in image and signal processing has proved to be of considerable interest in terms of both theoretical development and in their usefulness in several applications. In most of the applications, the Gaussian density is used in the mixture modeling of data. Although a Gaussian mixture may provide a reasonable approximation to many real-world distributions, it is certainly not always the best approximation especially in image and signal processing applications where we often deal with non-Gaussian data. In this thesis, we propose two novel approaches that may be used in modeling non-Gaussian data. These approaches use two highly flexible distributions, the generalized Gaussian distribution (GGD) and the general Beta distribution, in order to model the data. We are motivated by the fact that these distributions are able to fit many distributional shapes and then can be considered as a useful class of flexible models to address several problems and applications involving measurements and features having well-known marked deviation from the Gaussian shape. For the mixture estimation and selection problem, researchers have demonstrated that Bayesian approaches are fully optimal. The Bayesian learning allows the incorporation of prior knowledge in a formal coherent way that avoids overfitting problems. For this reason, we adopt different Bayesian approaches in order to learn our models parameters. First, we present a fully Bayesian approach to analyze finite generalized Gaussian mixture models which incorporate several standard mixtures, such as Laplace and Gaussian. This approach evaluates the posterior distribution and Bayes estimators using a Gibbs sampling algorithm, and selects the number of components in the mixture using the integrated likelihood. We also propose a fully Bayesian approach for finite Beta mixtures learning using a Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique which simultaneously allows cluster assignments, parameters estimation, and the selection of the optimal number of clusters. We then validate the proposed methods by applying them to different image processing applications
Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures
Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data
Sequential Gaussian Processes for Online Learning of Nonstationary Functions
Many machine learning problems can be framed in the context of estimating
functions, and often these are time-dependent functions that are estimated in
real-time as observations arrive. Gaussian processes (GPs) are an attractive
choice for modeling real-valued nonlinear functions due to their flexibility
and uncertainty quantification. However, the typical GP regression model
suffers from several drawbacks: i) Conventional GP inference scales
with respect to the number of observations; ii) updating a GP model
sequentially is not trivial; and iii) covariance kernels often enforce
stationarity constraints on the function, while GPs with non-stationary
covariance kernels are often intractable to use in practice. To overcome these
issues, we propose an online sequential Monte Carlo algorithm to fit mixtures
of GPs that capture non-stationary behavior while allowing for fast,
distributed inference. By formulating hyperparameter optimization as a
multi-armed bandit problem, we accelerate mixing for real time inference. Our
approach empirically improves performance over state-of-the-art methods for
online GP estimation in the context of prediction for simulated non-stationary
data and hospital time series data
The variational Bayesian approach to fitting mixture models to circular wave direction data
The emerging variational Bayesian (VB) technique for approximate Bayesian statistical inference is a nonsimulation- based and time-efficient approach. It provides a useful, practical alternative to other Bayesian statistical approaches such as Markov chain Monte Carlo–based techniques, particularly for applications involving large datasets. This article reviews the increasingly popular VB statistical approach and illustrates how it can be used to fit Gaussian mixture models to circular wave direction data. This is done by taking the straightforward approach of padding the data; this method involves adding a repeat of a complete cycle of the data to the existing dataset to obtain a dataset on the real line. The padded dataset can then be analyzed using the standard VB technique. This results in a practical, efficient approach that is also appropriate for modeling other types of circular, or directional, data such as wind direction
Clustering South African households based on their asset status using latent variable models
The Agincourt Health and Demographic Surveillance System has since 2001
conducted a biannual household asset survey in order to quantify household
socio-economic status (SES) in a rural population living in northeast South
Africa. The survey contains binary, ordinal and nominal items. In the absence
of income or expenditure data, the SES landscape in the study population is
explored and described by clustering the households into homogeneous groups
based on their asset status. A model-based approach to clustering the Agincourt
households, based on latent variable models, is proposed. In the case of
modeling binary or ordinal items, item response theory models are employed. For
nominal survey items, a factor analysis model, similar in nature to a
multinomial probit model, is used. Both model types have an underlying latent
variable structure - this similarity is exploited and the models are combined
to produce a hybrid model capable of handling mixed data types. Further, a
mixture of the hybrid models is considered to provide clustering capabilities
within the context of mixed binary, ordinal and nominal response data. The
proposed model is termed a mixture of factor analyzers for mixed data (MFA-MD).
The MFA-MD model is applied to the survey data to cluster the Agincourt
households into homogeneous groups. The model is estimated within the Bayesian
paradigm, using a Markov chain Monte Carlo algorithm. Intuitive groupings
result, providing insight to the different socio-economic strata within the
Agincourt region.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS726 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …