126 research outputs found

    Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors

    Get PDF
    Intrinsically, topic models have always their likelihood functions fixed to multinomial distributions as they operate on count data instead of Gaussian data. As a result, their performances ultimately depend on the flexibility of the chosen prior distributions when following the Bayesian paradigm compared to classical approaches such as PLSA (probabilistic latent semantic analysis), unigrams and mixture of unigrams that do not use prior information. The standard LDA (latent Dirichlet allocation) topic model operates with symmetric Dirichlet distribution (as a conjugate prior) which has been found to carry some limitations due to its independent structure that tends to hinder performance for instance in topic correlation including positively correlated data processing. Compared to classical ML estimators, the use of priors ultimately presents another unique advantage of smoothing out the multinomials while enhancing predictive topic models. In this thesis, we propose a series of flexible priors such as generalized Dirichlet (GD) and Beta-Liouville (BL) for our topic models within the collapsed representation, leading to much improved CVB (collapsed variational Bayes) update equations compared to ones from the standard LDA. This is because the flexibility of these priors improves significantly the lower bounds in the corresponding CVB algorithms. We also show the robustness of our proposed CVB inferences when using simultaneously the BL and GD in hybrid generative-discriminative models where the generative stage produces good and heterogeneous topic features that are used in the discriminative stage by powerful classifiers such as SVMs (support vector machines) as we propose efficient probabilistic kernels to facilitate processing (classification) of documents based on topic signatures. Doing so, we implicitly cast topic modeling which is an unsupervised learning method into a supervised learning technique. Furthermore, due to the complexity of the CVB algorithm (as it requires second order Taylor expansions) in general, despite its flexibility, we propose a much simpler and tractable update equation using a MAP (maximum a posteriori) framework with the standard EM (expectation-maximization) algorithm. As most Bayesian posteriors are not tractable for complex models, we ultimately propose the MAP-LBLA (latent BL allocation) where we characterize the contributions of asymmetric BL priors over the symmetric Dirichlet (Dir). The proposed MAP technique importantly offers a point estimate (mode) with a much tractable solution. In the MAP, we show that point estimate could be easy to implement than full Bayesian analysis that integrates over the entire parameter space. The MAP implicitly exhibits some equivalent relationship with the CVB especially the zero order approximations CVB0 and its stochastic version SCVB0. The proposed method enhances performances in information retrieval in text document analysis. We show that parametric topic models (as they are finite dimensional methods) have a much smaller hypothesis space and they generally suffer from model selection. We therefore propose a Bayesian nonparametric (BNP) technique that uses the Hierarchical Dirichlet process (HDP) as conjugate prior to the document multinomial distributions where the asymmetric BL serves as a diffuse (probability) base measure that provides the global atoms (topics) that are shared among documents. The heterogeneity in the topic structure helps in providing an alternative to model selection because the nonparametric topic model (which is infinite dimensional with a much bigger hypothesis space) could now prune out irrelevant topics based on the associated probability masses to only retain the most relevant ones. We also show that for large scale applications, stochastic optimizations using natural gradients of the objective functions have demonstrated significant performances when we learn rapidly both data and parameters in online fashion (streaming). We use both predictive likelihood and perplexity as evaluation methods to assess the robustness of our proposed topic models as we ultimately refer to probability as a way to quantify uncertainty in our Bayesian framework. We improve object categorization in terms of inferences through the flexibility of our prior distributions in the collapsed space. We also improve information retrieval technique with the MAP and the HDP-LBLA topic models while extending the standard LDA. These two applications present the ultimate capability of enhancing a search engine based on topic models

    Recursive Parameter Estimation of Non-Gaussian Hidden Markov Models for Occupancy Estimation in Smart Buildings

    Get PDF
    A significant volume of data has been produced in this era. Therefore, accurately modeling these data for further analysis and extraction of meaningful patterns is becoming a major concern in a wide variety of real-life applications. Smart buildings are one of these areas urgently demanding analysis of data. Managing the intelligent systems in smart homes, will reduce energy consumption as well as enhance users’ comfort. In this context, Hidden Markov Model (HMM) as a learnable finite stochastic model has consistently been a powerful tool for data modeling. Thus, we have been motivated to propose occupancy estimation frameworks for smart buildings through HMM due to the importance of indoor occupancy estimations in automating environmental settings. One of the key factors in modeling data with HMM is the choice of the emission probability. In this thesis, we have proposed novel HMMs extensions through Generalized Dirichlet (GD), Beta-Liouville (BL), Inverted Dirichlet (ID), Generalized Inverted Dirichlet (GID), and Inverted Beta-Liouville (IBL) distributions as emission probability distributions. These distributions have been investigated due to their capabilities in modeling a variety of non-Gaussian data, overcoming the limited covariance structures of other distributions such as the Dirichlet distribution. The next step after determining the emission probability is estimating an optimized parameter of the distribution. Therefore, we have developed a recursive parameter estimation based on maximum likelihood estimation approach (MLE). Due to the linear complexity of the proposed recursive algorithm, the developed models can successfully model real-time data, this allowed the models to be used in an extensive range of practical applications

    Unsupervised Learning with Feature Selection Based on Multivariate McDonald’s Beta Mixture Model for Medical Data Analysis

    Get PDF
    This thesis proposes innovative clustering approaches using finite and infinite mixture models to analyze medical data and human activity recognition. These models leverage the flexibility of a novel distribution, the multivariate McDonald’s Beta distribution, offering superior capability to model data of varying shapes. We introduce a finite McDonald’s Beta Mixture Model (McDBMM), demonstrating its superior performance in handling bounded and asymmetric data distributions compared to traditional Gaussian mixture models. Further, we employ deterministic learning methods such as maximum likelihood via the expectation maximization approach and also a Bayesian framework, in which we integrate feature selection. This integration enhances the efficiency and accuracy of our models, offering a compelling solution for real-world applications where manual annotation of large data volumes is not feasible. To address the prevalent challenge in clustering regarding the determination of mixture components number, we extend our finite mixture model to an infinite model. By adopting a nonparametric Bayesian technique, we can effectively capture the underlying data distribution with an unknown number of mixture components. Across all stages, our models are evaluated on various medical applications, consistently demonstrating superior performance over traditional alternatives. The results of this research underline the potential of the McDonald’s Beta distribution and the proposed mixture models in transforming medical data into actionable knowledge, aiding clinicians in making more precise decisions and improving health care industry

    Statistical Models for Short Text Clustering

    Get PDF
    A notable rise in the amounts of data collected, which were made available to the public, is witnessed. This allowed the emergence of many research problems among which extracting knowledge from short texts and their different related challenges. In this thesis, we elaborate new approaches to enhance short text clustering results obtained through the use of mixture models. We deployed the collapsed Gibbs sampling algorithm previously used with the Dirichlet Multinomial mixture model on our proposed statistical models. In particular, we proposed the collapsed Gibbs sampling generalized Dirichlet Multinomial (CGSGDM) and the collapsed Gibbs sampling Beta-Liouville Multinomial (CGSBLM) mixture models to cope with the challenges that come with short texts. We demonstrate the efficiency of our proposed approaches on the Google News corpora. We compared the experimental results with related works that made use of the Dirichlet distribution as a prior. Finally, we scaled our work to use infinite mixture models namely collapsed Gibbs sampling infinite generalized Dirichlet Multinomial mixture model (CGSIGDMM) and collapsed Gibbs sampling infinite Beta-Liouville Multinomial mixture model (CGSIBLMM). We also evaluate our proposed approaches on the Tweet dataset additionally to the previously used Google News dataset. An improvement of the work is also proposed through an online clustering process demonstrating good performance on the same used datasets. A final application is presented to assess the robustness of the proposed framework in the presence of outliers

    Novel Mixture Allocation Models for Topic Learning

    Get PDF
    Unsupervised learning has been an interesting area of research in recent years. Novel algorithms are being built on the basis of unsupervised learning methodologies to solve many real world problems. Topic modelling is one such fascinating methodology that identifies patterns as topics within data. Introduction of latent Dirichlet Allocation (LDA) has bolstered research on topic modelling approaches with modifications specific to the application. However, the basic assumption of a Dirichlet prior in LDA for topic proportions, might not be applicable in certain real world scenarios. Hence, in this thesis we explore the use of generalized Dirichlet (GD) and Beta-Liouville (BL) as alternative priors for topic proportions. In addition, we assume a mixture of distributions over topic proportions which provides better fit to the data. In order to accommodate application of the resulting models to real-time streaming data, we also provide an online learning solution for the models. A supervised version of the learning framework is also provided and is shown to be advantageous when labelled data are available. There is a slight chance that the topics thus derived may not be that accurate. In order to alleviate this problem, we integrate an interactive approach which uses inputs from the user to improve the quality of identified topics. We have also tweaked our models to be applied for interesting applications such as parallel topics extraction from multilingual texts and content based recommendation systems proving the adaptability of our proposed models. In the case of multilingual topic extraction, we use global topic proportions sampled from a Dirichlet process (DP) to tackle the problem and in the case of recommendation systems, we use the co-occurrences of words to our advantage. For inference, we use a variational approach which makes computation of variational solutions easier. The applications we validated our models with, show the efficiency of proposed models

    Modeling Semi-Bounded Support Data using Non-Gaussian Hidden Markov Models with Applications

    Get PDF
    With the exponential growth of data in all formats, and data categorization rapidly becoming one of the most essential components of data analysis, it is crucial to research and identify hidden patterns in order to extract valuable information that promotes accurate and solid decision making. Because data modeling is the first stage in accomplishing any of these tasks, its accuracy and consistency are critical for later development of a complete data processing framework. Furthermore, an appropriate distribution selection that corresponds to the nature of the data is a particularly interesting subject of research. Hidden Markov Models (HMMs) are some of the most impressively powerful probabilistic models, which have recently made a big resurgence in the machine learning industry, despite having been recognized for decades. Their ever-increasing application in a variety of critical practical settings to model varied and heterogeneous data (image, video, audio, time series, etc.) is the subject of countless extensions. Equally prevalent, finite mixture models are a potent tool for modeling heterogeneous data of various natures. The over-use of Gaussian mixture models for data modeling in the literature is one of the main driving forces for this thesis. This work focuses on modeling positive vectors, which naturally occur in a variety of real-life applications, by proposing novel HMMs extensions using the Inverted Dirichlet, the Generalized Inverted Dirichlet and the BetaLiouville mixture models as emission probabilities. These extensions are motivated by the proven capacity of these mixtures to deal with positive vectors and overcome mixture models’ impotence to account for any ordering or temporal limitations relative to the information. We utilize the aforementioned distributions to derive several theoretical approaches for learning and deploying Hidden Markov Modelsinreal-world settings. Further, we study online learning of parameters and explore the integration of a feature selection methodology. Extensive experimentation on highly challenging applications ranging from image categorization, video categorization, indoor occupancy estimation and Natural Language Processing, reveals scenarios in which such models are appropriate to apply, and proves their effectiveness compared to the extensively used Gaussian-based models

    Approximate Bayesian Inference for Count Data Modeling

    Get PDF
    Bayesian inference allows to make conclusions based on some antecedents that depend on prior knowledge. It additionally allows to quantify uncertainty, which is important in Machine Learning in order to make better predictions and model interpretability. However, in real applications, we often deal with complicated models for which is unfeasible to perform full Bayesian inference. This thesis explores the use of approximate Bayesian inference for count data modeling using Expectation Propagation and Stochastic Expectation Propagation. In Chapter 2, we develop an expectation propagation approach to learn an EDCM finite mixture model. The EDCM distribution is an exponential approximation to the widely used Dirichlet Compound distribution and has shown to offer excellent modeling capabilities in the case of sparse count data. Chapter 3 develops an efficient generative mixture model of EMSD distributions. We use Stochastic Expectation Propagation, which reduces memory consumption, important characteristic when making inference in large datasets. Finally, Chapter 4 develops a probabilistic topic model using the generalized Dirichlet distribution (LGDA) in order to capture topic correlation while maintaining conjugacy. We make use of Expectation Propagation to approximate the posterior, resulting in a model that achieves more accurate inference compared to variational inference. We show that latent topics can be used as a proxy for improving supervised tasks

    Distributional Feature Mapping in Data Classification

    Get PDF
    Performance of a machine learning algorithm depends on the representation of the input data. In computer vision problems, histogram based feature representation has significantly improved the classification tasks. L1 normalized histograms can be modelled by Dirichlet and related distributions to transform input space to feature space. We propose a mapping technique that contains prior knowledge about the distribution of the data and increases the discriminative power of the classifiers in supervised learning such as Support Vector Machine (SVM). The mapping technique for proportional data which is based on Dirichlet, Generalized Dirichlet, Beta Liouville, scaled Dirichlet and shifted scaled Dirichlet distributions can be incorporated with traditional kernels to improve the base kernels accuracy. Experimental results show that the proposed technique for proportional data increases accuracy for machine vision tasks such as natural scene recognition, satellite image classification, gender classification, facial expression recognition and human action recognition in videos. In addition, in object tracking, learning parametric features of the target object using Dirichlet and related distributions may help to capture representations invariant to noise. This further motivated our study of such distributions in object tracking. We propose a framework for feature representation on probability simplex for proportional data utilizing the histogram representation of the target object at initial frame. A set of parameter vectors determine the appearance features of the target object in the subsequent frames. Motivated by the success of distribution based feature mapping for proportional data, we extend this technique for semi-bounded data utilizing inverted Dirichlet, generalized inverted Dirichlet and inverted Beta Liouville distributions. Similar approach is taken into account for count data where Dirichlet multinomial and generalized Dirichlet multinomial distributions are used to map density features with input features
    • …
    corecore