9 research outputs found

    Clustering with shallow trees

    Full text link
    We propose a new method for hierarchical clustering based on the optimisation of a cost function over trees of limited depth, and we derive a message--passing method that allows to solve it efficiently. The method and algorithm can be interpreted as a natural interpolation between two well-known approaches, namely single linkage and the recently presented Affinity Propagation. We analyze with this general scheme three biological/medical structured datasets (human population based on genetic information, proteins based on sequences and verbal autopsies) and show that the interpolation technique provides new insight.Comment: 11 pages, 7 figure

    SteinerNet: a web server for integrating ‘omic’ data to discover hidden components of response pathways

    Get PDF
    High-throughput technologies including transcriptional profiling, proteomics and reverse genetics screens provide detailed molecular descriptions of cellular responses to perturbations. However, it is difficult to integrate these diverse data to reconstruct biologically meaningful signaling networks. Previously, we have established a framework for integrating transcriptional, proteomic and interactome data by searching for the solution to the prize-collecting Steiner tree problem. Here, we present a web server, SteinerNet, to make this method available in a user-friendly format for a broad range of users with data from any species. At a minimum, a user only needs to provide a set of experimentally detected proteins and/or genes and the server will search for connections among these data from the provided interactomes for yeast, human, mouse, Drosophila melanogaster and Caenorhabditis elegans. More advanced users can upload their own interactome data as well. The server provides interactive visualization of the resulting optimal network and downloadable files detailing the analysis and results. We believe that SteinerNet will be useful for researchers who would like to integrate their high-throughput data for a specific condition or cellular response and to find biologically meaningful pathways. SteinerNet is accessible at http://fraenkel.mit.edu/steinernet.National Institutes of Health (U.S.) (U54-CA112967)National Institutes of Health (U.S.) (R01-GM089903)National Science Foundation (Award Number DB1-0821391)National Institutes of Health (U.S.) (U54-CA112967

    Using affinity propagation for identifying subspecies among clonal organisms: lessons from M. tuberculosis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification and naming is a key step in the analysis, understanding and adequate management of living organisms. However, where to set limits between groups can be puzzling especially in clonal organisms. Within the <it>Mycobacterium tuberculosis </it>complex (MTC), the etiological agent of tuberculosis (TB), experts have first identified several groups according to their pattern at repetitive sequences, especially at the CRISPR locus (spoligotyping), and to their epidemiological relevance. Most groups such as "Beijing" found good support when tested with other loci. However, other groups such as T family and T1 subfamily (belonging to the "Euro-American" lineage) correspond to non-monophyletic groups and still need to be refined. Here, we propose to use a method called Affinity Propagation that has been successfully used in image categorization to identify relevant patterns at the CRISPR locus in MTC.</p> <p>Results</p> <p>To adequately infer the relative divergence time between strains, we used a distance method inspired by the recent evolutionary model by Reyes <it>et al</it>. We first confirm that this method performs better than the Jaccard index commonly used to compare spoligotype patterns. Second, we document the support of each spoligotype family among the previous classification using affinity propagation on the international spoligotyping database SpolDB4. This allowed us to propose a consensus assignation for all SpolDB4 spoligotypes. Third, we propose new signatures to subclassify the T family.</p> <p>Conclusion</p> <p>Altogether, this study shows how the new clustering algorithm Affinity Propagation can help building or refining clonal organims classifications. It also describes well-supported families and subfamilies among <it>M. tuberculosis </it>complex, especially inside the modern "Euro-American" lineage.</p

    Shape similarity, better than semantic membership, accounts for the structure of visual object representations in a population of monkey inferotemporal neurons

    Get PDF
    The anterior inferotemporal cortex (IT) is the highest stage along the hierarchy of visual areas that, in primates, processes visual objects. Although several lines of evidence suggest that IT primarily represents visual shape information, some recent studies have argued that neuronal ensembles in IT code the semantic membership of visual objects (i.e., represent conceptual classes such as animate and inanimate objects). In this study, we investigated to what extent semantic, rather than purely visual information, is represented in IT by performing a multivariate analysis of IT responses to a set of visual objects. By relying on a variety of machine-learning approaches (including a cutting-edge clustering algorithm that has been recently developed in the domain of statistical physics), we found that, in most instances, IT representation of visual objects is accounted for by their similarity at the level of shape or, more surprisingly, low-level visual properties. Only in a few cases we observed IT representations of semantic classes that were not explainable by the visual similarity of their members. Overall, these findings reassert the primary function of IT as a conveyor of explicit visual shape information, and reveal that low-level visual properties are represented in IT to a greater extent than previously appreciated. In addition, our work demonstrates how combining a variety of state-of-the-art multivariate approaches, and carefully estimating the contribution of shape similarity to the representation of object categories, can substantially advance our understanding of neuronal coding of visual objects in cortex

    Statistical mechanics for biological applications: focusing on the immune system

    Get PDF
    The emergence in the last decades of a huge amount of data in many fields of biology triggered also an increase of the interest by quantitative disciplines for life sciences. Mathematics, physics and informatics have been providing quantitative models and advanced statistical tools in order to help the understanding of many biological problems. Statistical mechanics is a field that particularly contributed to quantitative biology because of its intrinsic predisposition in dealing with systems of many strongly interacting agents, noise, information processing and statistical inference. In this Thesis a collection of works at the interphase between statistical mechanics and biology is presented. In particular they are related to biological problems that can be mainly reconducted to the biology of the immune system. Beyond the unification key given by statistical mechanics of discrete systems and quantitative modeling and analysis of the immune system, the works presented here are quite diversified. The origin of this heterogeneity resides in the intent of using and learning many different techniques during the lapse of time needed for the preparation of the work reviewed in this Thesis. In fact the work presented in Chapter 3 mainly deals with statistical mechanics, networks theory and networks numerical simulations and analysis; Chapter 4 presents a mathematical physics oriented work; Chapter 5 and 6 deal with data analysis and in particular wth clinical data and amino acid sequences data sets, requiring the use of both analytical and numerical techniques. The Thesis is conceptually organized in two main parts. The first part (Chapters 1 and 2) is dedicated to the review of known results both in statistical mechanics and biology, while in the second part (Chapters 3, 4 and 6) the original works are presented together with briefs insights into the research fields in which they can be embedded. In particular, in Chapter 1 some of the most relevant models and techniques in statistical mechanics of mean field spin systems are reviewed, starting with the Ising model and then passing to the Sherrington-Kirkpatrik model for spin glasses and to the Hopfield model for attractors neural networks. The replica method is presented together with the stochastic stability method as a mathematically rigorous alternative to replicas. Chapter 2 is dedicated to a very schematic overview of the biology of the immune system. In Chapter 3, Section 3.1 is dedicated to the presentation of a mathematical phenomenological model for the study of the idiotypic network while Section 3.2 serves as a review of the statistical mechanics based models proposed by Elena 1 2 Introduction Agliari and Adriano Barra as toy models meant to underline the possible role of complex networks within the immune system. In Chapter 4 the mathematical model of an analogue neural network on a diluted graph is studied. It is shown how the problem can be mapped in a bipartite diluted spin glass. The model is rigorously solved at the replica symmetric level with the use of the stochastic stability technique and fluctuations analysis is used to study the spin glass transition of the system. A topological analysis of the network is also performed and different topological regimes are proven to emerge though the tuning of the model parameters. In Chapter 5 a model for the analysis of clinical records of testing sets of patients is presented. The model is based on a Markov chain over the space of clinical states. The machinery is applied to data concerning the insurgence of Tuberculosis and Non-Tuberculous Infections as side effects in patients treated with Tumor Necrosis Factor inhibitors. The analysis procedure is capable of capturing clinical details of the behaviors of different drugs. Lastly, Chapter 6 is dedicated to a statistical inference analysis on deep sequencing data of an antibodies repertoire with the purpose of studying the problem of antibodies affinity maturation. A partial antibodies repertoire from a HIV-1 infected donor presenting broadly neutralizing serum is used to infer a probability distribution in the space of sequences that is compared with neutralization power measurements and with the deposited crystallographic structure of a deeply matured antibody. The work is still in progress, but preliminary results are encouraging and are presented here

    Statistical mechanics for biological applications: focusing on the immune system

    Get PDF
    The emergence in the last decades of a huge amount of data in many fields of biology triggered also an increase of the interest by quantitative disciplines for life sciences. Mathematics, physics and informatics have been providing quantitative models and advanced statistical tools in order to help the understanding of many biological problems. Statistical mechanics is a field that particularly contributed to quantitative biology because of its intrinsic predisposition in dealing with systems of many strongly interacting agents, noise, information processing and statistical inference. In this Thesis a collection of works at the interphase between statistical mechanics and biology is presented. In particular they are related to biological problems that can be mainly reconducted to the biology of the immune system. Beyond the unification key given by statistical mechanics of discrete systems and quantitative modeling and analysis of the immune system, the works presented here are quite diversified. The origin of this heterogeneity resides in the intent of using and learning many different techniques during the lapse of time needed for the preparation of the work reviewed in this Thesis. In fact the work presented in Chapter 3 mainly deals with statistical mechanics, networks theory and networks numerical simulations and analysis; Chapter 4 presents a mathematical physics oriented work; Chapter 5 and 6 deal with data analysis and in particular wth clinical data and amino acid sequences data sets, requiring the use of both analytical and numerical techniques. The Thesis is conceptually organized in two main parts. The first part (Chapters 1 and 2) is dedicated to the review of known results both in statistical mechanics and biology, while in the second part (Chapters 3, 4 and 6) the original works are presented together with briefs insights into the research fields in which they can be embedded. In particular, in Chapter 1 some of the most relevant models and techniques in statistical mechanics of mean field spin systems are reviewed, starting with the Ising model and then passing to the Sherrington-Kirkpatrik model for spin glasses and to the Hopfield model for attractors neural networks. The replica method is presented together with the stochastic stability method as a mathematically rigorous alternative to replicas. Chapter 2 is dedicated to a very schematic overview of the biology of the immune system. In Chapter 3, Section 3.1 is dedicated to the presentation of a mathematical phenomenological model for the study of the idiotypic network while Section 3.2 serves as a review of the statistical mechanics based models proposed by Elena 1 2 Introduction Agliari and Adriano Barra as toy models meant to underline the possible role of complex networks within the immune system. In Chapter 4 the mathematical model of an analogue neural network on a diluted graph is studied. It is shown how the problem can be mapped in a bipartite diluted spin glass. The model is rigorously solved at the replica symmetric level with the use of the stochastic stability technique and fluctuations analysis is used to study the spin glass transition of the system. A topological analysis of the network is also performed and different topological regimes are proven to emerge though the tuning of the model parameters. In Chapter 5 a model for the analysis of clinical records of testing sets of patients is presented. The model is based on a Markov chain over the space of clinical states. The machinery is applied to data concerning the insurgence of Tuberculosis and Non-Tuberculous Infections as side effects in patients treated with Tumor Necrosis Factor inhibitors. The analysis procedure is capable of capturing clinical details of the behaviors of different drugs. Lastly, Chapter 6 is dedicated to a statistical inference analysis on deep sequencing data of an antibodies repertoire with the purpose of studying the problem of antibodies affinity maturation. A partial antibodies repertoire from a HIV-1 infected donor presenting broadly neutralizing serum is used to infer a probability distribution in the space of sequences that is compared with neutralization power measurements and with the deposited crystallographic structure of a deeply matured antibody. The work is still in progress, but preliminary results are encouraging and are presented here

    Statistical mechanics models for biological systems: cooperativity in biochemistry and affinity maturation of antibodies

    Get PDF
    Statistical Mechanics provides useful tools and concepts to deal with collective behavior of many strongly interacting agents. Overlooking the detailed and the specific description of the interactions to focus on the very key features allows to ask different questions concerning the global systemic properties of biological systems. The information processing and statistical inference approach has became more urgent in the last decades due to the large amount of data coming from the exploit of different new experimental techniques. Concepts such as entropy, phase transition and criticality has entered the unavoidable terminology to describe the nature of biological systems at very different level of complexity: from the animal collective behaviour, the physiological apparatuses as nervous system and immune system to the biochemical processes in cells. The studies presented in this thesis are placed in this interdisciplinary border context. The thesis is divided in three main parts.The first is devoted to the more formal aspect of statistical mechanics models of spin systems. We review briefly, in the first chapter, three milestone models of spin systems: the Curie-Weiss, the Sherrington-Kirkpatrick and the Hopfield model. These models constitute the paradigmatic examples of mean-field Statistical Mechanics and will constitute the ground for the studies in biochemical kinetics and immunology presented in the following parts. In the second chapter we report a detailed study of a generalization of the Hopfield model with diluted and correlated patterns. We investigate the topology of the emergent interactions network. We find an exact expression of the coupling distribution that allows to distinguish different regimes varying the dilution parameter. Moreover we study the thermodynamic properties of the model, obtaining explicitly the replica symmetric free-energy coupled with its self-consistence equations. Considering the small overlap expansion of these self consistencies equations we get the critical surface dividing the ergodic phase to the spin-glass one. The second part of the thesis focus on the investigation of the cooperative behavior in biochemical kinetics through mean field statistical mechanics. Cooperativity is one of the most important properties of molecular interactions in biological systems as it is often invoked to account for collective features in binding phenomena. It constitutes a fundamental tool that nature developed to modulate the chemical response of biological systems to varying stimuli. Statistical mechanics offers a valuable approach as, from its first principles, it aims to figure out collective phenomena, allowing a unified and broader theory for complex chemical kinetics. In this way different cooperative behaviors, described by the related binding curves, can be analysed in an unified framework. We compare the theoretical curves predicted by the model with experimental data found in literature, finding an overall good agreement and extrapolating the values of the effective interactions between the binding sites, which can be put in direct correspondence with the standard coefficient that measure cooperativity (Hill number). Moreover, an extension of the model allows to take into account heterogeneity that can affect both the couplings between the multiple active sites (allosteric regulation) and the chemical potentials in the binding of the ligands. The last part is dedicated to a statistical inference analysis on deep sequencing data of an antibodies repertoire with the purpose of studying the process of antibodies affinity maturation. A partial antibodies repertoire from a HIV-1 infected donor presenting broadly neutralizing serum is used to infer a probability distribution in the space of sequences . The idea is to use the model to study the structure of the affinity with an antigen as a function of the antibody sequence. We test this strategy using neutralization power measurements and the deposited crystallographic structure of a deeply matured antibody. The work is still in progress, but preliminary results are encouraging and are presented here

    Text Analytics to Predict Time and Cause of Death from Verbal Autopsies

    Get PDF
    This thesis describes the first Text Analytics approach to predicting Causes of Death (CoD) from Verbal Autopsies (VA). VA is an alternative technique recommended by the World Health Organisation for ascertaining CoD in low and middle-income countries (LMIC). CoD information is vitally important in the provision of healthcare. CoD information from VA can be obtained via two main approaches: manual, also referred to as the physician-review and automatic. The automatic-based approach is an active research area due to its efficiency and cost effectiveness over the manual approach. VA contains both closed responses and open narrative text. However, the open narrative text has been ignored by the state-of-art automatic approaches and this remains a challenge and an important research issue. We hypothesise that it is feasible to predict CoD from the narratives of VA. We further contend that an automatic approach that could utilise the information contained in both narrative and closed response text of VA could lead to an improved prediction accuracy of CoD. This research has been formulated as a Text Classification problem, which employs Corpus and Computational Linguistics, Natural Language Processing and Machine Learning techniques to automatically classify VA documents according to CoD. Firstly, the research uses a VA corpus built from a sample collection of over 11,400 VA documents collected during a 10 year period in Ghana, West Africa. About 80 per cent of these documents have been annotated with CoD by medical experts. Secondly, we design experiments to identify Machine Learning techniques (algorithm, feature representation scheme, and feature reduction strategy) suitable for classifying VA open narratives (VAModel1). Thirdly, we propose novel methods of extracting features to build a model that predicts CoD from VA narratives using the annotated VA corpus as training and testing set. Furthermore, we develop two additional models: only closed responses based (VAModel2); and a hybrid of closed and open narrative based model (VAModel3). Our VAModel1 performs reasonably better than our baseline model, suggesting the feasibility of predicting the CoD from the VA open narratives. Overall, VAModel3 performance was observed to achieve better performance than VAModel1 but not significantly better than VAModel2. Also, in terms of reliability, VAModel1 obtained a moderate agreement (kappa score = 0.4) when compared with the gold standard– medical experts (average annotation agreement between medical experts, kappa score= 0.64). Furthermore, an acceptable agreement was obtained for VAModel2 (kappa score =0.71) and VAModel3 (kappa score =0.75), suggesting the reliability of these two models is better than medical experts. Also, a detailed analysis suggested that combining information from narratives and closed responses leads to an increase in performance for some CoD categories whereas information obtained from the closed responses part is enough for other CoD categories. Our research provides an alternative automatic approach to predicting CoD from VA, which is essential for LMIC. Therefore, further research into various aspects of the modelling process could improve the current performance of automatically predicting CoD from VAs
    corecore