234 research outputs found
Finite Bivariate and Multivariate Beta Mixture Models Learning and Applications
Finite mixture models have been revealed to provide flexibility for data clustering. They have demonstrated high competence and potential to capture hidden structure in data. Modern technological progresses, growing volumes and varieties of generated data, revolutionized computers and other related factors are contributing to produce large scale data. This fact enhances the significance of finding reliable and adaptable models which can analyze bigger, more complex data to identify latent patterns, deliver faster and more accurate results and make decisions with minimal human interaction.
Adopting the finest and most accurate distribution that appropriately represents the mixture components is critical. The most widely adopted generative model has been the Gaussian mixture. In numerous real-world applications, however, when the nature and structure of data are non-Gaussian, this modelling fails. One of the other crucial issues when using mixtures is determination of
the model complexity or number of mixture components. Minimum message length (MML) is one of the main techniques in frequentist frameworks to tackle this challenging issue.
In this work, we have designed and implemented a finite mixture model, using the bivariate and multivariate Beta
distributions for cluster analysis and demonstrated its flexibility in describing the intrinsic characteristics of the observed data.
In addition, we have applied our estimation and model selection algorithms to synthetic and real datasets. Most importantly, we considered interesting applications such as in image segmentation, software modules defect prediction, spam detection and occupancy estimation in smart buildings
Robust mixtures in the presence of measurement errors
We develop a mixture-based approach to robust density modeling and outlier
detection for experimental multivariate data that includes measurement error
information. Our model is designed to infer atypical measurements that are not
due to errors, aiming to retrieve potentially interesting peculiar objects.
Since exact inference is not possible in this model, we develop a
tree-structured variational EM solution. This compares favorably against a
fully factorial approximation scheme, approaching the accuracy of a
Markov-Chain-EM, while maintaining computational simplicity. We demonstrate the
benefits of including measurement errors in the model, in terms of improved
outlier detection rates in varying measurement uncertainty conditions. We then
use this approach in detecting peculiar quasars from an astrophysical survey,
given photometric measurements with errors.Comment: (Refereed) Proceedings of the 24-th Annual International Conference
on Machine Learning 2007 (ICML07), (Ed.) Z. Ghahramani. June 20-24, 2007,
Oregon State University, Corvallis, OR, USA, pp. 847-854; Omnipress. ISBN
978-1-59593-793-3; 8 pages, 6 figure
High-dimensional Sparse Count Data Clustering Using Finite Mixture Models
Due to the massive amount of available digital data, automating its analysis and modeling for
different purposes and applications has become an urgent need. One of the most challenging tasks
in machine learning is clustering, which is defined as the process of assigning observations sharing
similar characteristics to subgroups. Such a task is significant, especially in implementing complex
algorithms to deal with high-dimensional data. Thus, the advancement of computational power in
statistical-based approaches is increasingly becoming an interesting and attractive research domain.
Among the successful methods, mixture models have been widely acknowledged and successfully
applied in numerous fields as they have been providing a convenient yet flexible formal setting for
unsupervised and semi-supervised learning. An essential problem with these approaches is to develop
a probabilistic model that represents the data well by taking into account its nature. Count
data are widely used in machine learning and computer vision applications where an object, e.g.,
a text document or an image, can be represented by a vector corresponding to the appearance frequencies
of words or visual words, respectively. Thus, they usually suffer from the well-known
curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a
few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering
algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion
phenomena, which both cannot be handled with a generic multinomial distribution, typically
used to model count data, due to its dependency assumption.
This thesis is constructed around six related manuscripts, in which we propose several approaches
for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive
word occurrences. In such frameworks, a suitable distribution is used to introduce the prior
information into the construction of the statistical model, based on a conjugate distribution to the
multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous
computational advantages. Thus, we proposed a novel model that we call the Multinomial
Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow
more modeling flexibility. Although these frameworks can model burstiness and overdispersion
well, they share similar disadvantages making their estimation procedure is very inefficient when
the collection size is large. To handle high-dimensionality, we considered two approaches. First,
we derived close approximations to the distributions in a hierarchical structure to bring them to
the exponential-family form aiming to combine the flexibility and efficiency of these models with
the desirable statistical and computational properties of the exponential family of distributions, including
sufficiency, which reduce the complexity and computational efforts especially for sparse
and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach
for count data to overcome several issues that may be caused by the high dimensionality of
the feature space, such as over-fitting, low efficiency, and poor performance.
Furthermore, we handled two significant aspects of mixture based clustering methods, namely,
parameters estimation and performing model selection. We considered the Expectation-Maximization
(EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model
parameters, with incorporating several techniques to avoid its initialization dependency and poor
local maxima. For model selection, we investigated different approaches to find the optimal number
of components based on the Minimum Message Length (MML) philosophy. The effectiveness of
our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate
speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV
shows, facial expression recognition, face identification, and age estimation
Unsupervised Learning with Feature Selection Based on Multivariate McDonald’s Beta Mixture Model for Medical Data Analysis
This thesis proposes innovative clustering approaches using finite and infinite mixture models to analyze medical data and human activity recognition.
These models leverage the flexibility of a novel distribution, the multivariate McDonald’s Beta distribution, offering superior capability to model data of varying shapes. We introduce a finite McDonald’s Beta Mixture Model (McDBMM), demonstrating its superior performance in handling bounded and asymmetric data distributions compared to traditional Gaussian mixture models.
Further, we employ deterministic learning methods such as maximum likelihood via the expectation maximization approach and also a Bayesian framework, in which we integrate feature selection. This integration enhances the efficiency and accuracy of our models, offering a compelling solution for real-world applications where manual annotation of large data volumes is not feasible.
To address the prevalent challenge in clustering regarding the determination of mixture components number, we extend our finite mixture model to an infinite model. By adopting a nonparametric Bayesian technique, we can effectively capture the underlying data distribution with an unknown number of mixture components.
Across all stages, our models are evaluated on various medical applications, consistently demonstrating superior performance over traditional alternatives.
The results of this research underline the potential of the McDonald’s Beta distribution and the proposed mixture models in transforming medical data into actionable knowledge, aiding clinicians in making more precise decisions and improving health care industry
Mixture-Based Clustering for High-Dimensional Count Data Using Minorization-Maximization Approaches
The Multinomial distribution has been widely used to model count data. To increase
clustering efficiency, we use an approximation of the Fisher Scoring as a learning algorithm, which is more robust to the choice of the initial parameter values. Moreover,
we consider the generalization of the multinomial model obtained by introducing the
Dirichlet as prior, which is called the Dirichlet Compound Multinomial (DCM). Even
though DCM can address the burstiness phenomenon of count data, the presence of
Gamma function in its density function usually leads to undesired complications. In
this thesis, we use two alternative representations of DCM distribution to perform
clustering based on finite mixture models, where the mixture parameters are estimated using minorization-maximization algorithm. Moreover, we propose an online
learning technique for unsupervised clustering based on a mixture of Neerchal- Morel
distributions. While, the novel mixture model is able to capture overdispersion due
to a weight parameter assigned to each feature in each cluster, online learning is able
to overcome the drawbacks of batch learning in such a way that the mixture’s parameters can be updated instantly for any new data instances. Finally, by implementing
a minimum message length model selection criterion, the weights of irrelevant mixture components are driven towards zero, which resolves the problem of knowing the
number of clusters beforehand. To evaluate and compare the performance of our
proposed models, we have considered five challenging real-world applications that
involve high-dimensional count vectors, namely, sentiment analysis, topic detection,
facial expression recognition, human action recognition and medical diagnosis. The
results show that the proposed algorithms increase the clustering efficiency remarkably as compared to other benchmarks, and the best results are achieved by the
models able to accommodate over-dispersed count data
Unsupervised Selection and Estimation of Non-Gaussian Mixtures for High Dimensional Data Analysis
Lately, the enormous generation of databases in almost every aspect of life has created a great demand for new, powerful tools for turning data into useful information. Therefore, researchers were encouraged to explore and develop new machine learning ideas and methods. Mixture models are one of the machine learning techniques receiving considerable attention due to their ability to
handle efficiently and effectively multidimensional data. Generally, four critical issues have to be addressed when adopting mixture models in high dimensional spaces: (1) choice of the probability density functions, (2) estimation of the mixture parameters, (3) automatic determination of the number of components M in the mixture, and (4) determination of what features best discriminate among the different components. The main goal of this thesis is to summarize all these challenging
interrelated problems in one unified model.
In most of the applications, the Gaussian density is used in mixture modeling of data. Although a Gaussian mixture may provide a reasonable approximation to many real-world distributions, it is certainly not always the best approximation especially in computer vision and image processing applications where we often deal with non-Gaussian data. Therefore, we propose to use three
highly flexible distributions: the generalized Gaussian distribution (GGD), the asymmetric Gaussian distribution (AGD), and the asymmetric generalized Gaussian distribution (AGGD). We are motivated by the fact that these distributions are able to fit many distributional shapes and then can be considered as a useful class of flexible models to address several problems and applications involving measurements and features having well-known marked deviation from the Gaussian shape.
Recently, researches have shown that model selection and parameter learning are highly dependent and should be performed simultaneously. For this purpose, many approaches have been suggested. The vast majority of these approaches can be classified, from a computational point of view, into two classes: deterministic and stochastic methods. Deterministic methods estimate
the model parameters for a set of candidate models using the Expectation-Maximization (EM) framework, then choose the model that maximizes a model selection criterion. Stochastic methods such as Markov chain Monte Carlo (MCMC) can be used in order to sample from the full a posteriori distribution with M considered unknown. Hence, in this thesis, we propose three learning techniques capable of automatically determining model complexity while learning its parameters. First, we incorporate a Minimum Message Length (MML) penalty in the model learning step performed using the EM algorithm. Our second approach employs the Rival Penalized EM (RPEM)
algorithm which is able to select an appropriate number of densities by fading out the redundant densities from a density mixture. Last but not least, we incorporate the nonparametric aspect of mixture models by assuming a countably infinite number of components and using Markov Chain Monte Carlo (MCMC) simulations for the estimation of the posterior distributions. Hence, the difficulty of choosing the appropriate number of clusters is sidestepped by assuming that there are an infinite number of mixture components.
Another essential issue in the case of statistical modeling in general and finite mixtures in particular is feature selection (i.e. identification of the relevant or discriminative features describing the data) especially in the case of high-dimensional data. Indeed, feature selection has been shown to be a crucial step in several image processing, computer vision and pattern recognition
applications not only because it speeds up learning but also because it improves model accuracy and generalization. Moreover, the learning of the mixture parameters ( i.e. both model selection and parameters estimation) is greatly affected by the quality of the features used. Hence, in this thesis, we are trying to solve the feature selection problem in unsupervised learning by casting it as an estimation problem, thus avoiding any combinatorial search. Finally, the effectiveness of our approaches is evaluated by applying them to different computer vision and image processing
applications
- …