139 research outputs found

    Efficient Computation of Log-likelihood Function in Clustering Overdispersed Count Data

    Get PDF
    In this work, we present an overdispersed count data clustering algorithm, which uses the mesh method for computing the log-likelihood function, of the multinomial Dirichlet, multinomial generalized Dirichlet, and multinomial Beta-Liouville distributions. Count data are often used in many areas such as information retrieval, data mining, and computer vision. The multinomial Dirichlet distribution (MDD) is one of the widely used methods of modeling multi-categorical count data with overdispersion. In recent works, the use of the mesh algorithm, which involves the approximation of the multinomial Dirichlet distribution's (MDD) log-likelihood function, based on the Bernoulli polynomials; has been proposed instead of using the traditional numerical computation of the log-likelihood function which either results in instability, or leads to long run times that make its use infeasible when modeling large-scale data. Therefore, we extend the mesh algorithm approach for computing the log likelihood function of more flexible distributions, namely multinomial generalized Dirichlet (MGD) and multinomial Beta-Liouville (MBL). A finite mixture model based on these distributions, is optimized by expectation maximization, and attempts to achieve a high accuracy for count data clustering. Through a set of experiments, the proposed approach shows its merits in two real-world clustering problems, that concern natural scenes categorization and facial expression recognition

    Fully Bayesian Inference for Finite and Infinite Discrete Exponential Mixture Models

    Get PDF
    Count data often appears in natural language processing and computer vision applications. For example, in images and textual documents clustering, each image or text can be described by a histogram of visual words or text words. In real applications, these frequency vectors often show high-dimensional and sparsity nature. In this case, hierarchical Bayesian modeling frameworks show the ability to model the dependence of the word repetitive occurrences ’burstiness’. Moreover, approximating these models to exponential families is helpful to improve computing efficiency, especially when facing high-dimensional count data and large data sets. However, classical deterministic approaches such as expectation-maximization (EM) do not achieve good results in real-life complex applications. This thesis explores the use of a fully Bayesian inference for finite discrete exponential mixture models of Multinomial Generalized Dirichlet (EMGD), Multinomial Beta-Liouville (EMBL), Multinomial Scaled Dirichlet (EMSD), and Multinomial Shifted Scaled Dirichlet (EMSSD). Finite mixtures have already shown superior performance in real data sets clustering with EM approach. The proposed approaches in this thesis are based on Monte Carlo simulation technique of Gibbs sampling mixed with Metropolis-Hastings step, and we utilize exponential family conjugate prior information to construct the required posteriors relying on Bayesian theory. Furthermore, we also present the infinite models based on Dirichlet processes, which results in clustering algorithms that do not require the specification of the number of mixture components to be given in advance. The performance of our Bayesian approaches was tested in some challenging real-world applications concerning text sentiment analysis, fake news detection, and human face gender recognition

    High-dimensional Sparse Count Data Clustering Using Finite Mixture Models

    Get PDF
    Due to the massive amount of available digital data, automating its analysis and modeling for different purposes and applications has become an urgent need. One of the most challenging tasks in machine learning is clustering, which is defined as the process of assigning observations sharing similar characteristics to subgroups. Such a task is significant, especially in implementing complex algorithms to deal with high-dimensional data. Thus, the advancement of computational power in statistical-based approaches is increasingly becoming an interesting and attractive research domain. Among the successful methods, mixture models have been widely acknowledged and successfully applied in numerous fields as they have been providing a convenient yet flexible formal setting for unsupervised and semi-supervised learning. An essential problem with these approaches is to develop a probabilistic model that represents the data well by taking into account its nature. Count data are widely used in machine learning and computer vision applications where an object, e.g., a text document or an image, can be represented by a vector corresponding to the appearance frequencies of words or visual words, respectively. Thus, they usually suffer from the well-known curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion phenomena, which both cannot be handled with a generic multinomial distribution, typically used to model count data, due to its dependency assumption. This thesis is constructed around six related manuscripts, in which we propose several approaches for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive word occurrences. In such frameworks, a suitable distribution is used to introduce the prior information into the construction of the statistical model, based on a conjugate distribution to the multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous computational advantages. Thus, we proposed a novel model that we call the Multinomial Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow more modeling flexibility. Although these frameworks can model burstiness and overdispersion well, they share similar disadvantages making their estimation procedure is very inefficient when the collection size is large. To handle high-dimensionality, we considered two approaches. First, we derived close approximations to the distributions in a hierarchical structure to bring them to the exponential-family form aiming to combine the flexibility and efficiency of these models with the desirable statistical and computational properties of the exponential family of distributions, including sufficiency, which reduce the complexity and computational efforts especially for sparse and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach for count data to overcome several issues that may be caused by the high dimensionality of the feature space, such as over-fitting, low efficiency, and poor performance. Furthermore, we handled two significant aspects of mixture based clustering methods, namely, parameters estimation and performing model selection. We considered the Expectation-Maximization (EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model parameters, with incorporating several techniques to avoid its initialization dependency and poor local maxima. For model selection, we investigated different approaches to find the optimal number of components based on the Minimum Message Length (MML) philosophy. The effectiveness of our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV shows, facial expression recognition, face identification, and age estimation

    Distributional Feature Mapping in Data Classification

    Get PDF
    Performance of a machine learning algorithm depends on the representation of the input data. In computer vision problems, histogram based feature representation has significantly improved the classification tasks. L1 normalized histograms can be modelled by Dirichlet and related distributions to transform input space to feature space. We propose a mapping technique that contains prior knowledge about the distribution of the data and increases the discriminative power of the classifiers in supervised learning such as Support Vector Machine (SVM). The mapping technique for proportional data which is based on Dirichlet, Generalized Dirichlet, Beta Liouville, scaled Dirichlet and shifted scaled Dirichlet distributions can be incorporated with traditional kernels to improve the base kernels accuracy. Experimental results show that the proposed technique for proportional data increases accuracy for machine vision tasks such as natural scene recognition, satellite image classification, gender classification, facial expression recognition and human action recognition in videos. In addition, in object tracking, learning parametric features of the target object using Dirichlet and related distributions may help to capture representations invariant to noise. This further motivated our study of such distributions in object tracking. We propose a framework for feature representation on probability simplex for proportional data utilizing the histogram representation of the target object at initial frame. A set of parameter vectors determine the appearance features of the target object in the subsequent frames. Motivated by the success of distribution based feature mapping for proportional data, we extend this technique for semi-bounded data utilizing inverted Dirichlet, generalized inverted Dirichlet and inverted Beta Liouville distributions. Similar approach is taken into account for count data where Dirichlet multinomial and generalized Dirichlet multinomial distributions are used to map density features with input features

    Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

    Get PDF
    2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

    Distribution-based Regression for Count and Semi-Bounded Data

    Get PDF
    Data mining techniques have been successfully utilized in different applications of significant fields, including pattern recognition, computer vision, medical researches, etc. With the wealth of data generated every day, there is a lack of practical analysis tools to discover hidden relationships and trends. Among all statistical frameworks, regression has been proven to be one of the most strong tools in prediction. The complexity of data that is unfavorable for most models is a considerable challenge in prediction. The ability of a model to perform accurately and efficiently is extremely important. Thus, a model must be selected to fit the data well, such that the learning from previous data is efficient and highly accurate. This work is motivated by the limited number of regression analysis tools for multivariate count data in the literature. We propose two regression models for count data based on flexible distributions, namely, the multinomial Beta-Liouville and multinomial scaled Dirichlet, and evaluate them in the problem of disease diagnosis. The performance is measured based on the accuracy of the prediction, which depends on the nature and complexity of the dataset. Our results show the efficiency of the two proposed regression models where the prediction performance of both models is competitive to other previously used regression approaches for count data and to the best results in the literature. Then, we propose three regression models for positive vectors based on flexible distributions for semi-bounded data, namely, inverted Dirichlet, inverted generalize Dirichlet, and inverted Beta-Liouville. The efficiency of these models is tested via real-world applications, including software defects prediction, spam filtering, and disease diagnosis. Our results show that the performance of the three proposed regression models is better than other commonly used regression models

    Approximate Bayesian Inference for Count Data Modeling

    Get PDF
    Bayesian inference allows to make conclusions based on some antecedents that depend on prior knowledge. It additionally allows to quantify uncertainty, which is important in Machine Learning in order to make better predictions and model interpretability. However, in real applications, we often deal with complicated models for which is unfeasible to perform full Bayesian inference. This thesis explores the use of approximate Bayesian inference for count data modeling using Expectation Propagation and Stochastic Expectation Propagation. In Chapter 2, we develop an expectation propagation approach to learn an EDCM finite mixture model. The EDCM distribution is an exponential approximation to the widely used Dirichlet Compound distribution and has shown to offer excellent modeling capabilities in the case of sparse count data. Chapter 3 develops an efficient generative mixture model of EMSD distributions. We use Stochastic Expectation Propagation, which reduces memory consumption, important characteristic when making inference in large datasets. Finally, Chapter 4 develops a probabilistic topic model using the generalized Dirichlet distribution (LGDA) in order to capture topic correlation while maintaining conjugacy. We make use of Expectation Propagation to approximate the posterior, resulting in a model that achieves more accurate inference compared to variational inference. We show that latent topics can be used as a proxy for improving supervised tasks

    Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors

    Get PDF
    Intrinsically, topic models have always their likelihood functions fixed to multinomial distributions as they operate on count data instead of Gaussian data. As a result, their performances ultimately depend on the flexibility of the chosen prior distributions when following the Bayesian paradigm compared to classical approaches such as PLSA (probabilistic latent semantic analysis), unigrams and mixture of unigrams that do not use prior information. The standard LDA (latent Dirichlet allocation) topic model operates with symmetric Dirichlet distribution (as a conjugate prior) which has been found to carry some limitations due to its independent structure that tends to hinder performance for instance in topic correlation including positively correlated data processing. Compared to classical ML estimators, the use of priors ultimately presents another unique advantage of smoothing out the multinomials while enhancing predictive topic models. In this thesis, we propose a series of flexible priors such as generalized Dirichlet (GD) and Beta-Liouville (BL) for our topic models within the collapsed representation, leading to much improved CVB (collapsed variational Bayes) update equations compared to ones from the standard LDA. This is because the flexibility of these priors improves significantly the lower bounds in the corresponding CVB algorithms. We also show the robustness of our proposed CVB inferences when using simultaneously the BL and GD in hybrid generative-discriminative models where the generative stage produces good and heterogeneous topic features that are used in the discriminative stage by powerful classifiers such as SVMs (support vector machines) as we propose efficient probabilistic kernels to facilitate processing (classification) of documents based on topic signatures. Doing so, we implicitly cast topic modeling which is an unsupervised learning method into a supervised learning technique. Furthermore, due to the complexity of the CVB algorithm (as it requires second order Taylor expansions) in general, despite its flexibility, we propose a much simpler and tractable update equation using a MAP (maximum a posteriori) framework with the standard EM (expectation-maximization) algorithm. As most Bayesian posteriors are not tractable for complex models, we ultimately propose the MAP-LBLA (latent BL allocation) where we characterize the contributions of asymmetric BL priors over the symmetric Dirichlet (Dir). The proposed MAP technique importantly offers a point estimate (mode) with a much tractable solution. In the MAP, we show that point estimate could be easy to implement than full Bayesian analysis that integrates over the entire parameter space. The MAP implicitly exhibits some equivalent relationship with the CVB especially the zero order approximations CVB0 and its stochastic version SCVB0. The proposed method enhances performances in information retrieval in text document analysis. We show that parametric topic models (as they are finite dimensional methods) have a much smaller hypothesis space and they generally suffer from model selection. We therefore propose a Bayesian nonparametric (BNP) technique that uses the Hierarchical Dirichlet process (HDP) as conjugate prior to the document multinomial distributions where the asymmetric BL serves as a diffuse (probability) base measure that provides the global atoms (topics) that are shared among documents. The heterogeneity in the topic structure helps in providing an alternative to model selection because the nonparametric topic model (which is infinite dimensional with a much bigger hypothesis space) could now prune out irrelevant topics based on the associated probability masses to only retain the most relevant ones. We also show that for large scale applications, stochastic optimizations using natural gradients of the objective functions have demonstrated significant performances when we learn rapidly both data and parameters in online fashion (streaming). We use both predictive likelihood and perplexity as evaluation methods to assess the robustness of our proposed topic models as we ultimately refer to probability as a way to quantify uncertainty in our Bayesian framework. We improve object categorization in terms of inferences through the flexibility of our prior distributions in the collapsed space. We also improve information retrieval technique with the MAP and the HDP-LBLA topic models while extending the standard LDA. These two applications present the ultimate capability of enhancing a search engine based on topic models

    Count Data Modeling and Classification Using Statistical Hierarchical Approaches and Multi-topic Models

    Get PDF
    In this thesis, we propose and develop various statistical models to enhance and improve the efficiency of statistical modeling of count data in various applications. The major emphasis of the work is focused on developing hierarchical models. Various schemes of hierarchical structures are thus developed and analyzed in this work ranging from purely static hierarchies to dynamic models. The second part of the work concerns itself with the development of multitopic statistical models. It has been shown that these models provide more realistic modeling characteristics in comparison to mono topic models. We proceed with developing several multitopic models and we analyze their performance against benchmark models. We show that our proposed models in the majority of instances improve the modeling efficiency in comparison to some benchmark models, without drastically increasing the computational demands. In the last part of the work, we extend our proposed multitopic models to include online learning capability and again we show the relative superiority of our models in comparison to the benchmark models. Various real world applications such as object recognition, scene classification, text classification and action recognition, are used for analyzing the strengths and weaknesses of our proposed models

    Trust and Reputation Management: a Probabilistic Approach

    Get PDF
    Software architectures of large-scale systems are perceptibly shifting towards employing open and distributed computing. Web services emerged as autonomous and self-contained business applications that are published, found, and used over the web. These web services thus exist in an environment in which they interact among each other to achieve their goals. Two challenging tasks that govern the agents interactions have gained the attention of a large research community; web service selection and composition. The explosion of the number of published web services contributed to the growth of large pools of similarly functional services. While this is vital for a competitive and healthy marketplace, it complicates the aforementioned tasks. Service consumers resort to non-functional characteristics of available service providers to decide which service to interact with. Therefore, to optimize both tasks and maximize the gain of all involved agents, it is essential to build the capability of modeling and predicting the quality of these agents. In this thesis, we propose various trust and reputation models based on probabilistic approaches to address the web service selection and composition problems. These approaches consider the trustworthiness of a web service to be strongly tied to the outcomes of various quality of service metrics such as response time, throughput, and reliability. We represent these outcomes by a multinomial distribution whose parameters are learned using Bayesian inference which, given a likelihood function and a prior probability, derives the posterior probability. Since the likelihood, in this case, is a multinomial, a commonly used prior is the Dirichlet distribution. We propose, to overcome several limitations of the Dirichlet, by applying two alternative priors such as the generalized Dirichlet, and Beta-Liouville. Using these distributions, the learned parameters represent the probabilities of a web service to belong to each of the considered quality classes. These probabilities are consequently used to compute the trustworthiness of the evaluated web services and thus assisting consumers in the service selection process. Furthermore, after exploring the correlations among various quality metrics using real data sets, we introduce a hybrid trust model that captures these correlations using both Dirichlet and generalized Dirichlet distributions. Given their covariance structures, the former performs better when modeling negative correlations while the latter yields better modeling of positive correlations. To handle composite services, we propose various trust approaches using Bayesian networks and mixture models of three different distributions; the multinomial Dirichlet, the multinomial generalized Dirichlet, and the multinomial Beta-Liouville. Specifically, we employ a Bayesian network classifier with a Beta- Liouville prior to enable the classification of the QoS of composite services given the QoS of its constituents. In addition, we extend the previous models to function in online settings. Therefore, we present a generalized-Dirichlet power steady model that predicts compositional time series. We similarly extend the Bayesian networks model by using the Voting EM algorithm. This extension enables the estimation of the networks parameters after each interaction with a composite web service. Furthermore, we propose an algorithm to estimate the reputation of web services. We extend this algorithm by leveraging the capabilities of various clustering and outlier detection techniques to deal with malicious feedback and various strategic behavior commonly performed by web services. Alternatively, we suggest two data fusion methods for reputation feedback aggregation, namely, the covariance intersection and ellipsoidal intersection. These methods handle the dependency between the information that propagates through networks of interacting agents. They also avoid over confident estimates caused by redundant information. Finally, we present a reputation model for agent-based web services grouped into communities of homogeneous functionalities. We exploit various clustering and anomaly detection techniques to analyze and identify the quality trends provided by each service. This model enables the master of each community to allocate the requests it receives to the web service that best fulfill the quality requirements of the service consumers. We evaluate the effectiveness of the proposed approaches using both simulated and real data
    • …
    corecore