4 research outputs found

    Statistical Models for Short Text Clustering

    Get PDF
    A notable rise in the amounts of data collected, which were made available to the public, is witnessed. This allowed the emergence of many research problems among which extracting knowledge from short texts and their different related challenges. In this thesis, we elaborate new approaches to enhance short text clustering results obtained through the use of mixture models. We deployed the collapsed Gibbs sampling algorithm previously used with the Dirichlet Multinomial mixture model on our proposed statistical models. In particular, we proposed the collapsed Gibbs sampling generalized Dirichlet Multinomial (CGSGDM) and the collapsed Gibbs sampling Beta-Liouville Multinomial (CGSBLM) mixture models to cope with the challenges that come with short texts. We demonstrate the efficiency of our proposed approaches on the Google News corpora. We compared the experimental results with related works that made use of the Dirichlet distribution as a prior. Finally, we scaled our work to use infinite mixture models namely collapsed Gibbs sampling infinite generalized Dirichlet Multinomial mixture model (CGSIGDMM) and collapsed Gibbs sampling infinite Beta-Liouville Multinomial mixture model (CGSIBLMM). We also evaluate our proposed approaches on the Tweet dataset additionally to the previously used Google News dataset. An improvement of the work is also proposed through an online clustering process demonstrating good performance on the same used datasets. A final application is presented to assess the robustness of the proposed framework in the presence of outliers

    Mixture-Based Clustering for High-Dimensional Count Data Using Minorization-Maximization Approaches

    Get PDF
    The Multinomial distribution has been widely used to model count data. To increase clustering efficiency, we use an approximation of the Fisher Scoring as a learning algorithm, which is more robust to the choice of the initial parameter values. Moreover, we consider the generalization of the multinomial model obtained by introducing the Dirichlet as prior, which is called the Dirichlet Compound Multinomial (DCM). Even though DCM can address the burstiness phenomenon of count data, the presence of Gamma function in its density function usually leads to undesired complications. In this thesis, we use two alternative representations of DCM distribution to perform clustering based on finite mixture models, where the mixture parameters are estimated using minorization-maximization algorithm. Moreover, we propose an online learning technique for unsupervised clustering based on a mixture of Neerchal- Morel distributions. While, the novel mixture model is able to capture overdispersion due to a weight parameter assigned to each feature in each cluster, online learning is able to overcome the drawbacks of batch learning in such a way that the mixture’s parameters can be updated instantly for any new data instances. Finally, by implementing a minimum message length model selection criterion, the weights of irrelevant mixture components are driven towards zero, which resolves the problem of knowing the number of clusters beforehand. To evaluate and compare the performance of our proposed models, we have considered five challenging real-world applications that involve high-dimensional count vectors, namely, sentiment analysis, topic detection, facial expression recognition, human action recognition and medical diagnosis. The results show that the proposed algorithms increase the clustering efficiency remarkably as compared to other benchmarks, and the best results are achieved by the models able to accommodate over-dispersed count data

    Scaling-invariant maximum margin preference learning

    Get PDF
    One natural way to express preferences over items is to represent them in the form of pairwise comparisons, from which a model is learned in order to predict further preferences. In this setting, if an item a is preferred to the item b, then it is natural to consider that the preference still holds after multiplying both vectors by a positive scalar (e.g., ). Such invariance to scaling is satisfied in maximum margin learning approaches for pairs of test vectors, but not for the preference input pairs, i.e., scaling the inputs in a different way could result in a different preference relation being learned. In addition to the scaling of preference inputs, maximum margin methods are also sensitive to the way used for normalizing (scaling) the features, which is an essential pre-processing phase for these methods. In this paper, we define and analyse more cautious preference relations that are invariant to the scaling of features, or preference inputs, or both simultaneously; this leads to computational methods for testing dominance with respect to the induced relations, and for generating optimal solutions (i.e., best items) among a set of alternatives. In our experiments, we compare the relations and their associated optimality sets based on their decisiveness, computation time and cardinality of the optimal set
    corecore