557 research outputs found

    BClass: A Bayesian Approach Based on Mixture Models for Clustering and Classification of Heterogeneous Biological Data

    Get PDF
    Based on mixture models, we present a Bayesian method (called BClass) to classify biological entities (e.g. genes) when variables of quite heterogeneous nature are analyzed. Various statistical distributions are used to model the continuous/categorical data commonly produced by genetic experiments and large-scale genomic projects. We calculate the posterior probability of each entry to belong to each element (group) in the mixture. In this way, an original set of heterogeneous variables is transformed into a set of purely homogeneous characteristics represented by the probabilities of each entry to belong to the groups. The number of groups in the analysis is controlled dynamically by rendering the groups as 'alive' and 'dormant' depending upon the number of entities classified within them. Using standard Metropolis-Hastings and Gibbs sampling algorithms, we constructed a sampler to approximate posterior moments and grouping probabilities. Since this method does not require the definition of similarity measures, it is especially suitable for data mining and knowledge discovery in biological databases. We applied BClass to classify genes in RegulonDB, a database specialized in information about the transcriptional regulation of gene expression in the bacterium Escherichia coli. The classification obtained is consistent with current knowledge and allowed prediction of missing values for a number of genes. BClass is object-oriented and fully programmed in Lisp-Stat. The output grouping probabilities are analyzed and interpreted using graphical (dynamically linked plots) and query-based approaches. We discuss the advantages of using Lisp-Stat as a programming language as well as the problems we faced when the data volume increased exponentially due to the ever-growing number of genomic projects.

    On Similarities between Inference in Game Theory and Machine Learning

    No full text
    In this paper, we elucidate the equivalence between inference in game theory and machine learning. Our aim in so doing is to establish an equivalent vocabulary between the two domains so as to facilitate developments at the intersection of both fields, and as proof of the usefulness of this approach, we use recent developments in each field to make useful improvements to the other. More specifically, we consider the analogies between smooth best responses in fictitious play and Bayesian inference methods. Initially, we use these insights to develop and demonstrate an improved algorithm for learning in games based on probabilistic moderation. That is, by integrating over the distribution of opponent strategies (a Bayesian approach within machine learning) rather than taking a simple empirical average (the approach used in standard fictitious play) we derive a novel moderated fictitious play algorithm and show that it is more likely than standard fictitious play to converge to a payoff-dominant but risk-dominated Nash equilibrium in a simple coordination game. Furthermore we consider the converse case, and show how insights from game theory can be used to derive two improved mean field variational learning algorithms. We first show that the standard update rule of mean field variational learning is analogous to a Cournot adjustment within game theory. By analogy with fictitious play, we then suggest an improved update rule, and show that this results in fictitious variational play, an improved mean field variational learning algorithm that exhibits better convergence in highly or strongly connected graphical models. Second, we use a recent advance in fictitious play, namely dynamic fictitious play, to derive a derivative action variational learning algorithm, that exhibits superior convergence properties on a canonical machine learning problem (clustering a mixture distribution)

    Automated extraction of mutual independence patterns using Bayesian comparison of partition models

    Full text link
    Mutual independence is a key concept in statistics that characterizes the structural relationships between variables. Existing methods to investigate mutual independence rely on the definition of two competing models, one being nested into the other and used to generate a null distribution for a statistic of interest, usually under the asymptotic assumption of large sample size. As such, these methods have a very restricted scope of application. In the present manuscript, we propose to change the investigation of mutual independence from a hypothesis-driven task that can only be applied in very specific cases to a blind and automated search within patterns of mutual independence. To this end, we treat the issue as one of model comparison that we solve in a Bayesian framework. We show the relationship between such an approach and existing methods in the case of multivariate normal distributions as well as cross-classified multinomial distributions. We propose a general Markov chain Monte Carlo (MCMC) algorithm to numerically approximate the posterior distribution on the space of all patterns of mutual independence. The relevance of the method is demonstrated on synthetic data as well as two real datasets, showing the unique insight provided by this approach.Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence (in press

    Sampling-based optimization with mixtures

    No full text
    Sampling-based Evolutionary Algorithms (EA) are of great use when dealing with a highly non-convex and/or noisy optimization task, which is the kind of task we often have to solve in Machine Learning. Two derivative-free examples of such methods are Estimation of Distribution Algorithms (EDA) and techniques based on the Cross-Entropy Method (CEM). One of the main problems these algorithms have to solve is ïŹnding a good surrogate model for the normalized target function, that is, a model which has sufïŹcient complexity to ïŹt this target function, but which keeps the computations simple enough. Gaussian mixture models have been applied in practice with great success, but most of these approaches lacked a solid theoretical founding. In this paper we describe a sound mathematical justiïŹcation for Gaussian mixture surrogate models, more precisely we propose a proper derivation of an EDA/CEM algorithm with mixture updates using Expectation Maximization techniques. It will appear that this algorithm resembles the recent Population MCMC schemes, thus reinforcing the link between Monte- Carlo integration methods and sampling-based optimization. We will concentrate throughout this paper on continuous optimization

    Clickstream Data Analysis: A Clustering Approach Based on Mixture Hidden Markov Models

    Get PDF
    Nowadays, the availability of devices such as laptops and cell phones enables one to browse the web at any time and place. As a consequence, a company needs to have a website so as to maintain or increase customer loyalty and reach potential new customers. Besides, acting as a virtual point-of-sale, the company portal allows it to obtain insights on potential customers through clickstream data, web generated data that track users accesses and activities in websites. However, these data are not easy to handle as they are complex, unstructured and limited by lack of clear information about user intentions and goals. Clickstream data analysis is a suitable tool for managing the complexity of these datasets, obtaining a cleaned and processed sequential dataframe ready to identify and analyse patterns. Analysing clickstream data is important for companies as it enables them to under stand differences in web user behaviour while they explore websites, how they move from one page to another and what they select in order to define business strategies tar geting specific types of potential costumers. To obtain this level of insight it is pivotal to understand how to exploit hidden information related to clickstream data. This work presents the cleaning and pre-processing procedures for clickstream data which are needed to get a structured sequential dataset and analyses these sequences by the application of Mixture of discrete time Hidden Markov Models (MHMMs), a statisti cal tool suitable for clickstream data analysis and profile identification that has not been widely used in this context. Specifically, hidden Markov process accounts for a time varying latent variable to handle uncertainty and groups together observed states based on unknown similarity and entails identifying both the number of mixture components re lating to the subpopulations as well as the number of latent states for each latent Markov chain. However, the application of MHMMs requires the identification of both the number of components and states. Information Criteria (IC) are generally used for model selection in mixture hidden Markov models and, although their performance has been widely studied for mixture models and hidden Markov models, they have received little attention in the MHMM context. The most widely used criterion is BIC even if its performance for these models depends on factors such as the number of components and sequence length. Another class of model selection criteria is the Classification Criteria (CC). They were defined specifically for clustering purposes and rely on an entropy measure to account for separability between groups. These criteria are clearly the best option for our purpose, but their application as model selection tools for MHMMs requires the definition of a suitable entropy measure. In the light of these considerations, this work proposes a classification criterion based on an integrated classification likelihood approach for MHMMs that accounts for the two latent classes in the model: the subpopulations and the hidden states. This criterion is a modified ICL BIC, a classification criterion that was originally defined in the mixture model context and used in hidden Markov models. ICL BIC is a suitable score to identify the number of classes (components or states) and, thus, to extend it to MHMMs we de fined a joint entropy accounting for both a component-related entropy and a state-related conditional entropy. The thesis presents a Monte Carlo simulation study to compare selection criteria per formance, the results of which point out the limitations of the most commonly used infor mation criteria and demonstrate that the proposed criterion outperforms them in identify ing components and states, especially in short length sequences which are quite common in website accesses. The proposed selection criterion was applied to real clickstream data collected from the website of a Sicilian company operating in the hospitality sector. Data was modelled by an MHMM identifying clusters related to the browsing behaviour of web users which provided essential indications for developing new business strategies. This thesis is structured as follows: after an introduction on the main topics in Chapter 1, we present the clickstream data and their cleaning and pre-processing steps in Chapter 2; Chapter 3 illustrates the structure and estimation algorithms of mixture hidden Markov models; Chapter 4 presents a review of model selection criteria and the definition of the proposed ICL BIC for MHMMs; the real clickstream data analysis follows in Chapter 5

    On a loss-based prior for the number of components in mixture models

    Get PDF
    We introduce a prior distribution for the number of components of a mixture model. The prior considers the worth of each possible mixture, measured by a loss function with two components: one measures the loss in information in choosing the wrong mixture and one the loss due to complexity
    • 

    corecore