557 research outputs found
BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data
Based on mixture models, we present a Bayesian method (called BClass) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as 'alive' and 'dormant' depending upon the number of entities classified within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the definition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB, a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli. The classification obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes. BClass is object-oriented and fully programmed in Lisp-Stat. The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.
On Similarities between Inference in Game Theory and Machine Learning
In this paper, we elucidate the equivalence between inference in game theory and machine learning. Our aim in so doing is to establish an equivalent vocabulary between the two domains so as to facilitate developments at the intersection of both fields, and as proof of the usefulness of this approach, we use recent developments in each field to make useful improvements to the other. More specifically, we consider the analogies between smooth best responses in fictitious play and Bayesian inference methods. Initially, we use these insights to develop and demonstrate an improved algorithm for learning in games based on probabilistic moderation. That is, by integrating over the distribution of opponent strategies (a Bayesian approach within machine learning) rather than taking a simple empirical average (the approach used in standard fictitious play) we derive a novel moderated fictitious play algorithm and show that it is more likely than standard fictitious play to converge to a payoff-dominant but risk-dominated Nash equilibrium in a simple coordination game. Furthermore we consider the converse case, and show how insights from game theory can be used to derive two improved mean field variational learning algorithms. We first show that the standard update rule of mean field variational learning is analogous to a Cournot adjustment within game theory. By analogy with fictitious play, we then suggest an improved update rule, and show that this results in fictitious variational play, an improved mean field variational learning algorithm that exhibits better convergence in highly or strongly connected graphical models. Second, we use a recent advance in fictitious play, namely dynamic fictitious play, to derive a derivative action variational learning algorithm, that exhibits superior convergence properties on a canonical machine learning problem (clustering a mixture distribution)
Automated extraction of mutual independence patterns using Bayesian comparison of partition models
Mutual independence is a key concept in statistics that characterizes the
structural relationships between variables. Existing methods to investigate
mutual independence rely on the definition of two competing models, one being
nested into the other and used to generate a null distribution for a statistic
of interest, usually under the asymptotic assumption of large sample size. As
such, these methods have a very restricted scope of application. In the present
manuscript, we propose to change the investigation of mutual independence from
a hypothesis-driven task that can only be applied in very specific cases to a
blind and automated search within patterns of mutual independence. To this end,
we treat the issue as one of model comparison that we solve in a Bayesian
framework. We show the relationship between such an approach and existing
methods in the case of multivariate normal distributions as well as
cross-classified multinomial distributions. We propose a general Markov chain
Monte Carlo (MCMC) algorithm to numerically approximate the posterior
distribution on the space of all patterns of mutual independence. The relevance
of the method is demonstrated on synthetic data as well as two real datasets,
showing the unique insight provided by this approach.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (in
press
Sampling-based optimization with mixtures
Sampling-based Evolutionary Algorithms (EA) are of great use when dealing with a highly non-convex and/or noisy optimization task, which is the kind of task we often have to solve in Machine Learning. Two derivative-free examples of such methods are Estimation of Distribution Algorithms (EDA) and techniques based on the Cross-Entropy Method (CEM). One of the main problems these algorithms have to solve is ïŹnding a good surrogate model for the normalized target function, that is, a model which has sufïŹcient complexity to ïŹt this target function, but which keeps the computations simple enough. Gaussian mixture models have been applied in practice with great success, but most of these approaches lacked a solid theoretical founding. In this paper we describe a sound mathematical justiïŹcation for Gaussian mixture surrogate models, more precisely we propose a proper derivation of an EDA/CEM algorithm with mixture updates using Expectation Maximization techniques. It will appear that this algorithm resembles the recent Population MCMC schemes, thus reinforcing the link between Monte- Carlo integration methods and sampling-based optimization. We will concentrate throughout this paper on continuous optimization
Clickstream Data Analysis: A Clustering Approach Based on Mixture Hidden Markov Models
Nowadays, the availability of devices such as laptops and cell phones enables one to
browse the web at any time and place. As a consequence, a company needs to have a
website so as to maintain or increase customer loyalty and reach potential new customers.
Besides, acting as a virtual point-of-sale, the company portal allows it to obtain insights on
potential customers through clickstream data, web generated data that track users accesses
and activities in websites. However, these data are not easy to handle as they are complex,
unstructured and limited by lack of clear information about user intentions and goals.
Clickstream data analysis is a suitable tool for managing the complexity of these datasets,
obtaining a cleaned and processed sequential dataframe ready to identify and analyse
patterns.
Analysing clickstream data is important for companies as it enables them to under stand differences in web user behaviour while they explore websites, how they move
from one page to another and what they select in order to define business strategies tar geting specific types of potential costumers. To obtain this level of insight it is pivotal to
understand how to exploit hidden information related to clickstream data.
This work presents the cleaning and pre-processing procedures for clickstream data
which are needed to get a structured sequential dataset and analyses these sequences by
the application of Mixture of discrete time Hidden Markov Models (MHMMs), a statisti cal tool suitable for clickstream data analysis and profile identification that has not been
widely used in this context. Specifically, hidden Markov process accounts for a time varying latent variable to handle uncertainty and groups together observed states based
on unknown similarity and entails identifying both the number of mixture components re lating to the subpopulations as well as the number of latent states for each latent Markov
chain.
However, the application of MHMMs requires the identification of both the number
of components and states. Information Criteria (IC) are generally used for model selection in mixture hidden Markov models and, although their performance has been widely
studied for mixture models and hidden Markov models, they have received little attention
in the MHMM context. The most widely used criterion is BIC even if its performance for
these models depends on factors such as the number of components and sequence length.
Another class of model selection criteria is the Classification Criteria (CC). They were
defined specifically for clustering purposes and rely on an entropy measure to account for
separability between groups. These criteria are clearly the best option for our purpose, but
their application as model selection tools for MHMMs requires the definition of a suitable
entropy measure.
In the light of these considerations, this work proposes a classification criterion based
on an integrated classification likelihood approach for MHMMs that accounts for the two
latent classes in the model: the subpopulations and the hidden states. This criterion is
a modified ICL BIC, a classification criterion that was originally defined in the mixture
model context and used in hidden Markov models. ICL BIC is a suitable score to identify
the number of classes (components or states) and, thus, to extend it to MHMMs we de fined a joint entropy accounting for both a component-related entropy and a state-related
conditional entropy.
The thesis presents a Monte Carlo simulation study to compare selection criteria per formance, the results of which point out the limitations of the most commonly used infor mation criteria and demonstrate that the proposed criterion outperforms them in identify ing components and states, especially in short length sequences which are quite common
in website accesses. The proposed selection criterion was applied to real clickstream data
collected from the website of a Sicilian company operating in the hospitality sector. Data
was modelled by an MHMM identifying clusters related to the browsing behaviour of
web users which provided essential indications for developing new business strategies.
This thesis is structured as follows: after an introduction on the main topics in Chapter
1, we present the clickstream data and their cleaning and pre-processing steps in Chapter
2; Chapter 3 illustrates the structure and estimation algorithms of mixture hidden Markov
models; Chapter 4 presents a review of model selection criteria and the definition of the
proposed ICL BIC for MHMMs; the real clickstream data analysis follows in Chapter 5
On a loss-based prior for the number of components in mixture models
We introduce a prior distribution for the number of components of a mixture model. The prior considers the worth of each possible mixture, measured by a loss function with two components: one measures the loss in information in choosing the wrong mixture and one the loss due to complexity
- âŠ