99 research outputs found

    On the smoothing of multinomial estimates using Liouville mixture models and applications

    Get PDF
    There has been major progress in recent years in statistical model-based pattern recognition, data mining and knowledge discovery. In particular, generative models are widely used and are very reliable in terms of overall performance. Success of these models hinges on their ability to construct a representation which captures the underlying statistical distribution of data. In this article, we focus on count data modeling. Indeed, this kind of data is naturally generated in many contexts and in different application domains. Usually, models based on the multinomial assumption are used in this case that may have several shortcomings, especially in the case of high-dimensional sparse data. We propose then a principled approach to smooth multinomials using a mixture of Beta-Liouville distributions which is learned to reflect and model prior beliefs about multinomial parameters, via both theoretical interpretations and experimental validations, we argue that the proposed smoothing model is general and flexible enough to allow accurate representation of count data

    Modern considerations for the use of naive Bayes in the supervised classification of genetic sequence data

    Get PDF
    2021 Spring.Includes bibliographical references.Genetic sequence classification is the task of assigning a known genetic label to an unknown genetic sequence. Often, this is the first step in genetic sequence analysis and is critical to understanding data produced by molecular techniques like high throughput sequencing. Here, we explore an algorithm called naive Bayes that was historically successful in classifying 16S ribosomal gene sequences for microbiome analysis. We extend the naive Bayes classifier to perform the task of general sequence classification by leveraging advancements in computational parallelism and the statistical distributions that underlie naive Bayes. In Chapter 2, we show that our implementation of naive Bayes, called WarpNL, performs within a margin of error of modern classifiers like Kraken2 and local alignment. We discuss five crucial aspects of genetic sequence classification and show how these areas affect classifier performance: the query data, the reference sequence database, the feature encoding method, the classification algorithm, and access to computational resources. In Chapter 3, we cover the critical computational advancements introduced in WarpNL that make it efficient in a modern computing framework. This includes efficient feature encoding, introduction of a log-odds ratio for comparison of naive Bayes posterior estimates, description of schema for parallel and distributed naive Bayes architectures, and use of machine learning classifiers to perform outgroup sequence classification. Finally in Chapter 4, we explore a variant of the Dirichlet multinomial distribution that underlies the naive Bayes likelihood, called the beta-Liouville multinomial. We show that the beta-Liouville multinomial can be used to enhance classifier performance, and we provide mathematical proofs regarding its convergence during maximum likelihood estimation. Overall, this work explores the naive Bayes algorithm in a modern context and shows that it is competitive for genetic sequence classification

    Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors

    Get PDF
    Intrinsically, topic models have always their likelihood functions fixed to multinomial distributions as they operate on count data instead of Gaussian data. As a result, their performances ultimately depend on the flexibility of the chosen prior distributions when following the Bayesian paradigm compared to classical approaches such as PLSA (probabilistic latent semantic analysis), unigrams and mixture of unigrams that do not use prior information. The standard LDA (latent Dirichlet allocation) topic model operates with symmetric Dirichlet distribution (as a conjugate prior) which has been found to carry some limitations due to its independent structure that tends to hinder performance for instance in topic correlation including positively correlated data processing. Compared to classical ML estimators, the use of priors ultimately presents another unique advantage of smoothing out the multinomials while enhancing predictive topic models. In this thesis, we propose a series of flexible priors such as generalized Dirichlet (GD) and Beta-Liouville (BL) for our topic models within the collapsed representation, leading to much improved CVB (collapsed variational Bayes) update equations compared to ones from the standard LDA. This is because the flexibility of these priors improves significantly the lower bounds in the corresponding CVB algorithms. We also show the robustness of our proposed CVB inferences when using simultaneously the BL and GD in hybrid generative-discriminative models where the generative stage produces good and heterogeneous topic features that are used in the discriminative stage by powerful classifiers such as SVMs (support vector machines) as we propose efficient probabilistic kernels to facilitate processing (classification) of documents based on topic signatures. Doing so, we implicitly cast topic modeling which is an unsupervised learning method into a supervised learning technique. Furthermore, due to the complexity of the CVB algorithm (as it requires second order Taylor expansions) in general, despite its flexibility, we propose a much simpler and tractable update equation using a MAP (maximum a posteriori) framework with the standard EM (expectation-maximization) algorithm. As most Bayesian posteriors are not tractable for complex models, we ultimately propose the MAP-LBLA (latent BL allocation) where we characterize the contributions of asymmetric BL priors over the symmetric Dirichlet (Dir). The proposed MAP technique importantly offers a point estimate (mode) with a much tractable solution. In the MAP, we show that point estimate could be easy to implement than full Bayesian analysis that integrates over the entire parameter space. The MAP implicitly exhibits some equivalent relationship with the CVB especially the zero order approximations CVB0 and its stochastic version SCVB0. The proposed method enhances performances in information retrieval in text document analysis. We show that parametric topic models (as they are finite dimensional methods) have a much smaller hypothesis space and they generally suffer from model selection. We therefore propose a Bayesian nonparametric (BNP) technique that uses the Hierarchical Dirichlet process (HDP) as conjugate prior to the document multinomial distributions where the asymmetric BL serves as a diffuse (probability) base measure that provides the global atoms (topics) that are shared among documents. The heterogeneity in the topic structure helps in providing an alternative to model selection because the nonparametric topic model (which is infinite dimensional with a much bigger hypothesis space) could now prune out irrelevant topics based on the associated probability masses to only retain the most relevant ones. We also show that for large scale applications, stochastic optimizations using natural gradients of the objective functions have demonstrated significant performances when we learn rapidly both data and parameters in online fashion (streaming). We use both predictive likelihood and perplexity as evaluation methods to assess the robustness of our proposed topic models as we ultimately refer to probability as a way to quantify uncertainty in our Bayesian framework. We improve object categorization in terms of inferences through the flexibility of our prior distributions in the collapsed space. We also improve information retrieval technique with the MAP and the HDP-LBLA topic models while extending the standard LDA. These two applications present the ultimate capability of enhancing a search engine based on topic models

    Distribution-based Regression for Count and Semi-Bounded Data

    Get PDF
    Data mining techniques have been successfully utilized in different applications of significant fields, including pattern recognition, computer vision, medical researches, etc. With the wealth of data generated every day, there is a lack of practical analysis tools to discover hidden relationships and trends. Among all statistical frameworks, regression has been proven to be one of the most strong tools in prediction. The complexity of data that is unfavorable for most models is a considerable challenge in prediction. The ability of a model to perform accurately and efficiently is extremely important. Thus, a model must be selected to fit the data well, such that the learning from previous data is efficient and highly accurate. This work is motivated by the limited number of regression analysis tools for multivariate count data in the literature. We propose two regression models for count data based on flexible distributions, namely, the multinomial Beta-Liouville and multinomial scaled Dirichlet, and evaluate them in the problem of disease diagnosis. The performance is measured based on the accuracy of the prediction, which depends on the nature and complexity of the dataset. Our results show the efficiency of the two proposed regression models where the prediction performance of both models is competitive to other previously used regression approaches for count data and to the best results in the literature. Then, we propose three regression models for positive vectors based on flexible distributions for semi-bounded data, namely, inverted Dirichlet, inverted generalize Dirichlet, and inverted Beta-Liouville. The efficiency of these models is tested via real-world applications, including software defects prediction, spam filtering, and disease diagnosis. Our results show that the performance of the three proposed regression models is better than other commonly used regression models

    Multi-source change-point detection over local observation models

    Get PDF
    In this work, we address the problem of change-point detection (CPD) on high-dimensional, multi-source, and heterogeneous sequential data with missing values. We present a new CPD methodology based on local latent variable models and adaptive factorizations that enhances the fusion of multi-source observations with different statistical data-type and face the problem of high dimensionality. Our motivation comes from behavioral change detection in healthcare measured by smartphone monitored data and Electronic Health Records. Due to the high dimension of the observations and the differences in the relevance of each source information, other works fail in obtaining reliable estimates of the change-points location. This leads to methods that are not sensitive enough when dealing with interspersed changes of different intensity within the same sequence or partial missing components. Through the definition of local observation models (LOMs), we transfer the local CP information to homogeneous latent spaces and propose several factorizations that weight the contribution of each source to the global CPD. With the presented methods we demonstrate a reduction in both the detection delay and the number of not-detected CPs, together with robustness against the presence of missing values on a synthetic dataset. We illustrate its application on real-world data from a smartphone-based monitored study and add explainability on the degree of each source contributing to the detection.This work has been partly supported by Spanish government (AEI/MCI) under grants RTI2018-099655-B-100, PID2021-123182OB-I00, PID2021-125159NB-I00, and TED2021-131823B-I00, by Comunidad de Madrid under grant IND2018/TIC-9649, IND2022/TIC- 23550, by the European Union (FEDER) and the European Research Council (ERC) through the European Union's Horizon 2020 research and innovation program under Grant 714161, and by Comunidad de Madrid and FEDER through IntCARE-CM

    Trust and Reputation Management: a Probabilistic Approach

    Get PDF
    Software architectures of large-scale systems are perceptibly shifting towards employing open and distributed computing. Web services emerged as autonomous and self-contained business applications that are published, found, and used over the web. These web services thus exist in an environment in which they interact among each other to achieve their goals. Two challenging tasks that govern the agents interactions have gained the attention of a large research community; web service selection and composition. The explosion of the number of published web services contributed to the growth of large pools of similarly functional services. While this is vital for a competitive and healthy marketplace, it complicates the aforementioned tasks. Service consumers resort to non-functional characteristics of available service providers to decide which service to interact with. Therefore, to optimize both tasks and maximize the gain of all involved agents, it is essential to build the capability of modeling and predicting the quality of these agents. In this thesis, we propose various trust and reputation models based on probabilistic approaches to address the web service selection and composition problems. These approaches consider the trustworthiness of a web service to be strongly tied to the outcomes of various quality of service metrics such as response time, throughput, and reliability. We represent these outcomes by a multinomial distribution whose parameters are learned using Bayesian inference which, given a likelihood function and a prior probability, derives the posterior probability. Since the likelihood, in this case, is a multinomial, a commonly used prior is the Dirichlet distribution. We propose, to overcome several limitations of the Dirichlet, by applying two alternative priors such as the generalized Dirichlet, and Beta-Liouville. Using these distributions, the learned parameters represent the probabilities of a web service to belong to each of the considered quality classes. These probabilities are consequently used to compute the trustworthiness of the evaluated web services and thus assisting consumers in the service selection process. Furthermore, after exploring the correlations among various quality metrics using real data sets, we introduce a hybrid trust model that captures these correlations using both Dirichlet and generalized Dirichlet distributions. Given their covariance structures, the former performs better when modeling negative correlations while the latter yields better modeling of positive correlations. To handle composite services, we propose various trust approaches using Bayesian networks and mixture models of three different distributions; the multinomial Dirichlet, the multinomial generalized Dirichlet, and the multinomial Beta-Liouville. Specifically, we employ a Bayesian network classifier with a Beta- Liouville prior to enable the classification of the QoS of composite services given the QoS of its constituents. In addition, we extend the previous models to function in online settings. Therefore, we present a generalized-Dirichlet power steady model that predicts compositional time series. We similarly extend the Bayesian networks model by using the Voting EM algorithm. This extension enables the estimation of the networks parameters after each interaction with a composite web service. Furthermore, we propose an algorithm to estimate the reputation of web services. We extend this algorithm by leveraging the capabilities of various clustering and outlier detection techniques to deal with malicious feedback and various strategic behavior commonly performed by web services. Alternatively, we suggest two data fusion methods for reputation feedback aggregation, namely, the covariance intersection and ellipsoidal intersection. These methods handle the dependency between the information that propagates through networks of interacting agents. They also avoid over confident estimates caused by redundant information. Finally, we present a reputation model for agent-based web services grouped into communities of homogeneous functionalities. We exploit various clustering and anomaly detection techniques to analyze and identify the quality trends provided by each service. This model enables the master of each community to allocate the requests it receives to the web service that best fulfill the quality requirements of the service consumers. We evaluate the effectiveness of the proposed approaches using both simulated and real data

    Approximate Bayesian Inference for Count Data Modeling

    Get PDF
    Bayesian inference allows to make conclusions based on some antecedents that depend on prior knowledge. It additionally allows to quantify uncertainty, which is important in Machine Learning in order to make better predictions and model interpretability. However, in real applications, we often deal with complicated models for which is unfeasible to perform full Bayesian inference. This thesis explores the use of approximate Bayesian inference for count data modeling using Expectation Propagation and Stochastic Expectation Propagation. In Chapter 2, we develop an expectation propagation approach to learn an EDCM finite mixture model. The EDCM distribution is an exponential approximation to the widely used Dirichlet Compound distribution and has shown to offer excellent modeling capabilities in the case of sparse count data. Chapter 3 develops an efficient generative mixture model of EMSD distributions. We use Stochastic Expectation Propagation, which reduces memory consumption, important characteristic when making inference in large datasets. Finally, Chapter 4 develops a probabilistic topic model using the generalized Dirichlet distribution (LGDA) in order to capture topic correlation while maintaining conjugacy. We make use of Expectation Propagation to approximate the posterior, resulting in a model that achieves more accurate inference compared to variational inference. We show that latent topics can be used as a proxy for improving supervised tasks

    Novel Mixture Allocation Models for Topic Learning

    Get PDF
    Unsupervised learning has been an interesting area of research in recent years. Novel algorithms are being built on the basis of unsupervised learning methodologies to solve many real world problems. Topic modelling is one such fascinating methodology that identifies patterns as topics within data. Introduction of latent Dirichlet Allocation (LDA) has bolstered research on topic modelling approaches with modifications specific to the application. However, the basic assumption of a Dirichlet prior in LDA for topic proportions, might not be applicable in certain real world scenarios. Hence, in this thesis we explore the use of generalized Dirichlet (GD) and Beta-Liouville (BL) as alternative priors for topic proportions. In addition, we assume a mixture of distributions over topic proportions which provides better fit to the data. In order to accommodate application of the resulting models to real-time streaming data, we also provide an online learning solution for the models. A supervised version of the learning framework is also provided and is shown to be advantageous when labelled data are available. There is a slight chance that the topics thus derived may not be that accurate. In order to alleviate this problem, we integrate an interactive approach which uses inputs from the user to improve the quality of identified topics. We have also tweaked our models to be applied for interesting applications such as parallel topics extraction from multilingual texts and content based recommendation systems proving the adaptability of our proposed models. In the case of multilingual topic extraction, we use global topic proportions sampled from a Dirichlet process (DP) to tackle the problem and in the case of recommendation systems, we use the co-occurrences of words to our advantage. For inference, we use a variational approach which makes computation of variational solutions easier. The applications we validated our models with, show the efficiency of proposed models
    • …
    corecore