668 research outputs found
Finite Bivariate and Multivariate Beta Mixture Models Learning and Applications
Finite mixture models have been revealed to provide flexibility for data clustering. They have demonstrated high competence and potential to capture hidden structure in data. Modern technological progresses, growing volumes and varieties of generated data, revolutionized computers and other related factors are contributing to produce large scale data. This fact enhances the significance of finding reliable and adaptable models which can analyze bigger, more complex data to identify latent patterns, deliver faster and more accurate results and make decisions with minimal human interaction.
Adopting the finest and most accurate distribution that appropriately represents the mixture components is critical. The most widely adopted generative model has been the Gaussian mixture. In numerous real-world applications, however, when the nature and structure of data are non-Gaussian, this modelling fails. One of the other crucial issues when using mixtures is determination of
the model complexity or number of mixture components. Minimum message length (MML) is one of the main techniques in frequentist frameworks to tackle this challenging issue.
In this work, we have designed and implemented a finite mixture model, using the bivariate and multivariate Beta
distributions for cluster analysis and demonstrated its flexibility in describing the intrinsic characteristics of the observed data.
In addition, we have applied our estimation and model selection algorithms to synthetic and real datasets. Most importantly, we considered interesting applications such as in image segmentation, software modules defect prediction, spam detection and occupancy estimation in smart buildings
High-dimensional Sparse Count Data Clustering Using Finite Mixture Models
Due to the massive amount of available digital data, automating its analysis and modeling for
different purposes and applications has become an urgent need. One of the most challenging tasks
in machine learning is clustering, which is defined as the process of assigning observations sharing
similar characteristics to subgroups. Such a task is significant, especially in implementing complex
algorithms to deal with high-dimensional data. Thus, the advancement of computational power in
statistical-based approaches is increasingly becoming an interesting and attractive research domain.
Among the successful methods, mixture models have been widely acknowledged and successfully
applied in numerous fields as they have been providing a convenient yet flexible formal setting for
unsupervised and semi-supervised learning. An essential problem with these approaches is to develop
a probabilistic model that represents the data well by taking into account its nature. Count
data are widely used in machine learning and computer vision applications where an object, e.g.,
a text document or an image, can be represented by a vector corresponding to the appearance frequencies
of words or visual words, respectively. Thus, they usually suffer from the well-known
curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a
few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering
algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion
phenomena, which both cannot be handled with a generic multinomial distribution, typically
used to model count data, due to its dependency assumption.
This thesis is constructed around six related manuscripts, in which we propose several approaches
for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive
word occurrences. In such frameworks, a suitable distribution is used to introduce the prior
information into the construction of the statistical model, based on a conjugate distribution to the
multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous
computational advantages. Thus, we proposed a novel model that we call the Multinomial
Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow
more modeling flexibility. Although these frameworks can model burstiness and overdispersion
well, they share similar disadvantages making their estimation procedure is very inefficient when
the collection size is large. To handle high-dimensionality, we considered two approaches. First,
we derived close approximations to the distributions in a hierarchical structure to bring them to
the exponential-family form aiming to combine the flexibility and efficiency of these models with
the desirable statistical and computational properties of the exponential family of distributions, including
sufficiency, which reduce the complexity and computational efforts especially for sparse
and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach
for count data to overcome several issues that may be caused by the high dimensionality of
the feature space, such as over-fitting, low efficiency, and poor performance.
Furthermore, we handled two significant aspects of mixture based clustering methods, namely,
parameters estimation and performing model selection. We considered the Expectation-Maximization
(EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model
parameters, with incorporating several techniques to avoid its initialization dependency and poor
local maxima. For model selection, we investigated different approaches to find the optimal number
of components based on the Minimum Message Length (MML) philosophy. The effectiveness of
our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate
speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV
shows, facial expression recognition, face identification, and age estimation
Unsupervised Learning with Feature Selection Based on Multivariate McDonald’s Beta Mixture Model for Medical Data Analysis
This thesis proposes innovative clustering approaches using finite and infinite mixture models to analyze medical data and human activity recognition.
These models leverage the flexibility of a novel distribution, the multivariate McDonald’s Beta distribution, offering superior capability to model data of varying shapes. We introduce a finite McDonald’s Beta Mixture Model (McDBMM), demonstrating its superior performance in handling bounded and asymmetric data distributions compared to traditional Gaussian mixture models.
Further, we employ deterministic learning methods such as maximum likelihood via the expectation maximization approach and also a Bayesian framework, in which we integrate feature selection. This integration enhances the efficiency and accuracy of our models, offering a compelling solution for real-world applications where manual annotation of large data volumes is not feasible.
To address the prevalent challenge in clustering regarding the determination of mixture components number, we extend our finite mixture model to an infinite model. By adopting a nonparametric Bayesian technique, we can effectively capture the underlying data distribution with an unknown number of mixture components.
Across all stages, our models are evaluated on various medical applications, consistently demonstrating superior performance over traditional alternatives.
The results of this research underline the potential of the McDonald’s Beta distribution and the proposed mixture models in transforming medical data into actionable knowledge, aiding clinicians in making more precise decisions and improving health care industry
Positive Data Clustering based on Generalized Inverted Dirichlet Mixture Model
Recent advances in processing and networking capabilities of computers have caused an accumulation
of immense amounts of multimodal multimedia data (image, text, video). These data
are generally presented as high-dimensional vectors of features. The availability of these highdimensional
data sets has provided the input to a large variety of statistical learning applications
including clustering, classification, feature selection, outlier detection and density estimation. In
this context, a finite mixture offers a formal approach to clustering and a powerful tool to tackle
the problem of data modeling. A mixture model assumes that the data is generated by a set of
parametric probability distributions. The main learning process of a mixture model consists of the
following two parts: parameter estimation and model selection (estimation the number of components).
In addition, other issues may be considered during the learning process of mixture models
such as the: a) feature selection and b) outlier detection. The main objective of this thesis is to
work with different kinds of estimation criteria and to incorporate those challenges into a single
framework.
The first contribution of this thesis is to propose a statistical framework which can tackle the problem
of parameter estimation, model selection, feature selection, and outlier rejection in a unified
model. We propose to use feature saliency and introduce an expectation-maximization (EM) algorithm
for the estimation of the Generalized Inverted Dirichlet (GID) mixture model. By using
the Minimum Message Length (MML), we can identify how much each feature contributes to
our model as well as determine the number of components. The presence of outliers is an added
challenge and is handled by incorporating an auxiliary outlier component, to which we associate a uniform density. Experimental results on synthetic data, as well as real world applications involving
visual scenes and object classification, indicates that the proposed approach was promising,
even though low-dimensional representation of the data was applied. In addition, it showed
the importance of embedding an outlier component to the proposed model. EM learning suffers
from significant drawbacks. In order to overcome those drawbacks, a learning approach using a
Bayesian framework is proposed as our second contribution. This learning is based on the estimation
of the parameters posteriors and by considering the prior knowledge about these parameters.
Calculation of the posterior distribution of each parameter in the model is done by using Markov
chain Monte Carlo (MCMC) simulation methods - namely, the Gibbs sampling and the Metropolis-
Hastings methods. The Bayesian Information Criterion (BIC) was used for model selection. The
proposed model was validated on object classification and forgery detection applications. For the
first two contributions, we developed a finite GID mixture. However, in the third contribution,
we propose an infinite GID mixture model. The proposed model simutaneously tackles the clustering
and feature selection problems. The proposed learning model is based on Gibbs sampling.
The effectiveness of the proposed method is shown using image categorization application. Our
last contribution in this thesis is another fully Bayesian approach for a finite GID mixture learning
model using the Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique. The
proposed algorithm allows for the simultaneously handling of the model selection and parameter estimation for high dimensional data. The merits of this approach are investigated using synthetic
data, and data generated from a challenging namely object detection
Decorrelation of Neutral Vector Variables: Theory and Applications
In this paper, we propose novel strategies for neutral vector variable
decorrelation. Two fundamental invertible transformations, namely serial
nonlinear transformation and parallel nonlinear transformation, are proposed to
carry out the decorrelation. For a neutral vector variable, which is not
multivariate Gaussian distributed, the conventional principal component
analysis (PCA) cannot yield mutually independent scalar variables. With the two
proposed transformations, a highly negatively correlated neutral vector can be
transformed to a set of mutually independent scalar variables with the same
degrees of freedom. We also evaluate the decorrelation performances for the
vectors generated from a single Dirichlet distribution and a mixture of
Dirichlet distributions. The mutual independence is verified with the distance
correlation measurement. The advantages of the proposed decorrelation
strategies are intensively studied and demonstrated with synthesized data and
practical application evaluations
Multidimensional Proportional Data Clustering Using Shifted-Scaled Dirichlet Model
We have designed and implemented an unsupervised learning algorithm for a finite
mixture model of shifted-scaled Dirichlet distributions for the cluster analysis of multivariate
proportional data. The cluster analysis task involves model selection using Minimum
Message Length to discover the number of natural groupings a dataset is composed of.
Also, it involves an estimation step for the model parameters using the expectation maximization
framework. This thesis aims to improve the flexibility of the widely used Dirichlet
model by adding another set of parameters for the location (beside the scale parameter)
We have applied our estimation and model selection algorithm to synthetic generated
data, real data and software modules defect prediction. The experimental results show
the merits of the shifted scaled Dirichlet mixture model performance in comparison to
previously used generative models
A Study on Variational Component Splitting approach for Mixture Models
Increase in use of mobile devices and the introduction of cloud-based services have resulted in the generation of enormous amount of data every day. This calls for the need to group these data appropriately into proper categories. Various clustering techniques have been introduced over the years to learn the patterns in data that might better facilitate the classification process. Finite mixture model is one of the crucial methods used for this task. The basic idea of mixture models is to fit the data at hand to an appropriate distribution. The design of mixture models hence involves finding the appropriate parameters of the distribution and estimating the number of clusters in the data. We use a variational component splitting framework to do this which could simultaneously
learn the parameters of the model and estimate the number of components in the model. The variational algorithm helps to overcome the computational complexity of purely Bayesian approaches and the over fitting problems experienced with Maximum Likelihood approaches guaranteeing convergence. The choice of distribution remains the core concern of mixture models in recent research. The efficiency of Dirichlet family of distributions for this purpose has been proved in latest studies especially for non-Gaussian data. This led us to study the impact of variational component splitting approach on mixture models based on several distributions. Hence, our contribution is the application of variational component splitting approach to design finite mixture models based on inverted Dirichlet, generalized inverted Dirichlet and inverted Beta-Liouville distributions. In addition, we also incorporate a simultaneous feature selection approach for generalized inverted Dirichlet mixture model along with component splitting as another experimental contribution. We evaluate the performance of our models with various real-life applications such as object, scene, texture, speech and video categorization
Model Selection for Gaussian Mixture Models
This paper is concerned with an important issue in finite mixture modelling,
the selection of the number of mixing components. We propose a new penalized
likelihood method for model selection of finite multivariate Gaussian mixture
models. The proposed method is shown to be statistically consistent in
determining of the number of components. A modified EM algorithm is developed
to simultaneously select the number of components and to estimate the mixing
weights, i.e. the mixing probabilities, and unknown parameters of Gaussian
distributions. Simulations and a real data analysis are presented to illustrate
the performance of the proposed method
Cluster Analysis of Multivariate Data Using Scaled Dirichlet Finite Mixture Model
We have designed and implemented a finite mixture model, using the scaled Dirichlet distribution for the cluster analysis of multivariate proportional data. In this thesis, the task of cluster analysis first involves model selection which helps to discover the number of natural groupings underlying a dataset. This activity is then followed by that of estimating the model parameters for those natural groupings using the expectation maximization framework.
This work, aims to address the flexibility challenge of the Dirichlet distribution by introduction of a distribution with an extra model parameter. This is important because scientists and researchers are constantly searching for the best models that can fully describe the intrinsic characteristics of the observed data and flexible models are increasingly used to achieve such purposes.
In addition, we have applied our estimation and model selection algorithm to both synthetic and real datasets. Most importantly, we considered two areas of application in software modules defect prediction and in customer segmentation. Today, there is a growing challenge of detecting defected modules early in complex software development projects. Therefore, making these sort of machine learning algorithms crucial in driving key quality improvements that impacts bottom-line and customer satisfaction
- …