96 research outputs found

    Mixed membership stochastic blockmodels

    Full text link
    Observations consisting of measurements on relationships for pairs of objects arise in many settings, such as protein interaction and gene regulatory networks, collections of author-recipient email, and social networks. Analyzing such data with probabilisic models can be delicate because the simple exchangeability assumptions underlying many boilerplate models no longer hold. In this paper, we describe a latent variable model of such data called the mixed membership stochastic blockmodel. This model extends blockmodels for relational data to ones which capture mixed membership latent relational structure, thus providing an object-specific low-dimensional representation. We develop a general variational inference algorithm for fast approximate posterior inference. We explore applications to social and protein interaction networks.Comment: 46 pages, 14 figures, 3 table

    A survey on Bayesian nonparametric learning

    Full text link
    © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. Bayesian (machine) learning has been playing a significant role in machine learning for a long time due to its particular ability to embrace uncertainty, encode prior knowledge, and endow interpretability. On the back of Bayesian learning's great success, Bayesian nonparametric learning (BNL) has emerged as a force for further advances in this field due to its greater modelling flexibility and representation power. Instead of playing with the fixed-dimensional probabilistic distributions of Bayesian learning, BNL creates a new “game” with infinite-dimensional stochastic processes. BNL has long been recognised as a research subject in statistics, and, to date, several state-of-the-art pilot studies have demonstrated that BNL has a great deal of potential to solve real-world machine-learning tasks. However, despite these promising results, BNL has not created a huge wave in the machine-learning community. Esotericism may account for this. The books and surveys on BNL written by statisticians are overcomplicated and filled with tedious theories and proofs. Each is certainly meaningful but may scare away new researchers, especially those with computer science backgrounds. Hence, the aim of this article is to provide a plain-spoken, yet comprehensive, theoretical survey of BNL in terms that researchers in the machine-learning community can understand. It is hoped this survey will serve as a starting point for understanding and exploiting the benefits of BNL in our current scholarly endeavours. To achieve this goal, we have collated the extant studies in this field and aligned them with the steps of a standard BNL procedure-from selecting the appropriate stochastic processes through manipulation to executing the model inference algorithms. At each step, past efforts have been thoroughly summarised and discussed. In addition, we have reviewed the common methods for implementing BNL in various machine-learning tasks along with its diverse applications in the real world as examples to motivate future studies

    Bayesian and machine learning approaches in metagenomics

    Get PDF
    In this doctoral thesis, we present a novel set of bioinformatics tools to address key problems in the field of metagenomics. This set includes a fully probabilistic framework for estimating the number of present genomes on a species level in a metagenomic sample, the use of variational encoders as an alternative method for dimensionality reduction of the coverage and the tetramer composition of metagenomic samples and a natural language processing method for compressing the number of gene frequencies in metagenomes for better prediction of their phenotypic traits. The first tool tackles the problem of metagenomic binning. A Bayesian non-parametric method is used in conjunction with a Gaussian mixture model to estimate more accurately the number of present genomes, and also correctly cluster the contigs into the appropriate bin. We call this method DP (Dirichlet Processes) algorithm. An attempt was made to improve the accuracy of the algorithm by incorporating extra information from the edges of the assembly graph, but this addition was not used to the final model as the signal from data used is too weak. This method is validated in a 20-genomes simulated mock community and is compared against the state-of-the-art binners in a 100 genome simulated community in different scenarios using different number of samples. The results show that this method perform at least in the same standards as the state-of-the-art methods, while outperforming them in some scenarios. This method is also applied on a real 11 sample infant gut dataset. The second tool is about the prediction of phenotypic traits in metagenomes. In this part, we build on the idea of using the frequencies of genes annotated, based on the Kyoto Encyclopedia of Genes and Genomes (KEGG), to predict the presence and absence of 83 functional and metabolic traits. We apply the doc2vec algorithm as a dimensionality reduction method on 9407 prokaryotic genomes, experimenting with different compression dimensions and training on various machine learning algorithms for the trait prediction part. We conclude that the dimensionality reduction improves the performance of the classifiers, and it achieves the best results when combined with L-1 logistic regression on 100 dimensions. In addition, we train the classifiers on using the uncompressed KO frequencies and we identify in which traits the compression offers no improvement, comparing the number of KOs present in each case. The third tool presented is about the use of variational autoencoders for compressing the coverage and tetramer composition before binnig in metagenomic samples. We combine the variational autoencoder architecture used in the VAMB binner for dimensionality reduction with the Bayesian non-parametric binning approach we presented above. We tested this novel combination using the same 20-genomes simulated mock community we used previously and we concluded that this combination performs better in clustering the contigs correctly than the DP algorithm on the species level. We also concluded that this combination does not perform well in real datasets, being unable to identify any `good' bins, assessed by the percentage of single-copy core genes present. The last part of this work is case study of the oral microbiome. It is estimated that the oral hosts over 700 species of bacteria. In this study, we analyze 131 oral samples metagenomic samples from 68 individuals. We follow an assembly-based approach and then we split the analysis in two directions. In the first approach, the contigs are binned and the abundance of each sample to each bin is calulated. In the second approach, the contigs are not binned; open reading frames are called and mapped to KEGG genes and the coverage of each gene in every sample is calculated. We associate these coverages with various metadata and attribute their variation in the presence of different species or KOs

    A survey of statistical network models

    Full text link
    Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a form of graphical representation. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology from the 1960s, these early works generated an active network community and a substantial literature in the 1970s. This effort moved into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical physics and computer science. The growth of the World Wide Web and the emergence of online networking communities such as Facebook, MySpace, and LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and network data. Our goal in this review is to provide the reader with an entry point to this burgeoning literature. We begin with an overview of the historical development of statistical network modeling and then we introduce a number of examples that have been studied in the network literature. Our subsequent discussion focuses on a number of prominent static and dynamic network models and their interconnections. We emphasize formal model descriptions, and pay special attention to the interpretation of parameters and their estimation. We end with a description of some open problems and challenges for machine learning and statistics.Comment: 96 pages, 14 figures, 333 reference

    Generative Probabilistic Models of Biological and Social Network Data

    Get PDF
    Useat monimutkaiset systeemit voidaan esittää verkkona, jossa kaaret yhdistävät solmuja. Soluissa molekyylien, kuten proteiinien, vuorovaikutukset muodostavat verkon, ja sosiaalinen systeemi voi koostua yksittäisten toimijoiden suhteista. Verkkojen analysointi on kehittynyt pienen ihmisjoukon välisten suhteiden tutkimisesta valtavien monimutkaisten verkkojen, kuten Facebookin ja My- Spacen tapaisten kommunikaatioverkkojen tai solun laajuisten molekyyliverkkojen, analysointiin. Sen lisäksi, että käytännön verkot ovat erittäin suuria, ne ovat tyypillisesti harvoja ja epätäydellistä. Tällaisten verkkojen menestyksekäs analysointi vaatii kehittyneiden laskennallisten menetelmien käyttöä. Tämän diplomityön aiheena on uusi generatiivinen todennäköisyysmalliperhe, vuorovaikutuskomponenttimallit. Se on suunniteltu tiheästi kytkettyjen aliverkkojen löytämiseen kohinaisesta verkkodatasta. Tällaisilla aliverkoilla on monia tulkintoja käytännön sovelluksissa, kuten toiminnalliset geenimoduulit proteiinien vuorovaikutusverkoissa tai yhteisöt sosiaalisissa verkoissa. Malliperhe on suunniteltu mahdollisimman yksinkertaiseksi, jotta se olisi ymmärrettävä ja laskennallisesti toteutettavissa. Tässä työssä mallia sovelletaan uuteen ongelmaan, proteiinien vuorovaikutusverkkoihin, ja tavoitteena on löytää biologisesti järkeviä toiminnallisia moduuleita. Vaihtoehtoja mallin laajentamiseksi ymmärtämään myös verkkoja rikkaampaa dataa, kuten solmujen ominaisuuksia, esitellään ja kokeillaan. Tehdyissä kokeissa mallit löytävät tulkittavia klusterirakenteita verkoista useilla sovellusalueilla. Ehdotetut muutokset parantavat mallin suorituskykyä.Many complex systems can be represented as networks in which nodes are connected with edges. In cells, interactions between molecules, such as proteins, form a network, and social systems can consist of relationships between individual actors. Network analysis has developed from early studies of relationships between a small group of people to the analysis of huge complex networks, such as communication networks like Facebook and MySpace, or cell-wide biomolecular networks. In addition to being very large, the networks arising from real-world systems are typically sparse and contain missing and incomplete data. Successful analysis of such networks thus requires advanced computational methods. The topic of this thesis is a new generative probabilistic modeling framework, interaction component models, which is designed to detect densely connected subnetworks from noisy network data. Such subnetworks have many interpretations in practical applications, such as functional gene modules in protein interaction networks or communities in social networks. The model family is designed to be as simple as possible, to keep it understandable and computationally feasible. In this thesis, the model is applied to a new problem domain, namely protein interaction networks, in order to detect biologically relevant functional modules. Extensions to include additional data, such as attributes of the nodes, into the analysis are proposed and tested. Improvements to model inference are also introduced and their effect studied. In the experiments, models are able to find meaningful cluster structures from networks in several problem domains. The proposed modifications improve model performance
    corecore