84 research outputs found

    Validating Gene Clusterings by Selecting Informative Gene Ontology Terms with Mutual Information

    Full text link
    We propose a method for global validation of gene clusterings. The method selects a set of informative and non-redundant GO terms through an exploration of the Gene Ontology structure guided by mutual information. Our approach yields a global assessment of the clustering quality, and a higher level interpretation for the clusters, as it relates GO terms with specific clusters. We show that in two gene expression data sets our method offers an improvement over previous approaches

    Algorithms to Explore the Structure and Evolution of Biological Networks

    Get PDF
    High-throughput experimental protocols have revealed thousands of relationships amongst genes and proteins under various conditions. These putative associations are being aggressively mined to decipher the structural and functional architecture of the cell. One useful tool for exploring this data has been computational network analysis. In this thesis, we propose a collection of novel algorithms to explore the structure and evolution of large, noisy, and sparsely annotated biological networks. We first introduce two information-theoretic algorithms to extract interesting patterns and modules embedded in large graphs. The first, graph summarization, uses the minimum description length principle to find compressible parts of the graph. The second, VI-Cut, uses the variation of information to non-parametrically find groups of topologically cohesive and similarly annotated nodes in the network. We show that both algorithms find structure in biological data that is consistent with known biological processes, protein complexes, genetic diseases, and operational taxonomic units. We also propose several algorithms to systematically generate an ensemble of near-optimal network clusterings and show how these multiple views can be used together to identify clustering dynamics that any single solution approach would miss. To facilitate the study of ancient networks, we introduce a framework called ``network archaeology'') for reconstructing the node-by-node and edge-by-edge arrival history of a network. Starting with a present-day network, we apply a probabilistic growth model backwards in time to find high-likelihood previous states of the graph. This allows us to explore how interactions and modules may have evolved over time. In experiments with real-world social and biological networks, we find that our algorithms can recover significant features of ancestral networks that have long since disappeared. Our work is motivated by the need to understand large and complex biological systems that are being revealed to us by imperfect data. As data continues to pour in, we believe that computational network analysis will continue to be an essential tool towards this end

    Automatically exploiting genomic and metabolic contexts to aid the functional annotation of prokaryote genomes

    Get PDF
    Cette thÚse porte sur le développement d'approches bioinformatiques exploitant de l'information de contextes génomiques et métaboliques afin de générer des annotations fonctionnelles de gÚnes prokaryotes, et comporte deux projets principaux. Le premier projet focalise sur les activités enzymatiques orphelines de séquence. Environ 27% des activités définies par le International Union of Biochemistry and Molecular Biology sont encore aujourd'hui orphelines. Pour celles-ci, les méthodes bioinformatiques traditionnelles ne peuvent proposer de gÚnes candidats; il est donc impératif d'utiliser des méthodes exploitant des informations contextuelles dans ces cas. La stratégie CanOE (fishingCandidate genes for Orphan Enzymes) a été développée et rajoutée à la plateforme MicroScope dans ce but, intégrant des informations génomiques et métaboliques sur des milliers d'organismes prokaryotes afin de localiser des gÚnes probants pour des activités orphelines. Le projet miroir au précédent est celui des protéines de fonction inconnue. Un projet collaboratif a été initié au Genoscope afin de formaliser les stratégies d'exploration des fonctions de familles protéiques prokaryotes. Une version pilote du projet a été mise en place sur la famille DUF849 dont une fonction enzymatique avait été récemment découverte. Des stratégies de proposition d'activités enzymatiques alternatives et d'établissement de sous familles isofonctionnelles ont été mises en place dans le cadre de cette thÚse, afin de guider les expérimentations de paillasse et d'analyser leurs résultats.The subject of this thesis concerns the development of bioinformatic strategies exploiting genomic and metabolic contextual information in order to generate functional annotations for prokaryote genes. Two main projects were involved during this work: the first focuses on sequence-orphan enzymatic activities. Today, roughly 27% of activities defined by International Union of Biochemistry and Molecular Biology are sequence-orphans. For these, traditional bioinformatic approaches cannot propose candidate genes. It is thus imperative to use alternative, context-based approaches in such cases. The CanOE strategy fishing Candidate genes for Orphan Enzymes) was developed and added to the MicroScope bioinformatics platform in this aim. It integrates genomic and metabolic information across thousands of prokaryote genomes in order to locate promising gene candidates for orphan activities. The mirror project focuses on protein families of unknown function. A collaborative project has been set up at the Genoscope in hope of formalising functional exploration strategies for prokaryote protein families. A pilot version was created on the DUF849 Pfam family, for which a single activity had recently been elucidated. Strategies for proposing novel functions and activities and creating isofunctional sub-families were researched, so as to guide biochemical experimentations and to analyse their results.EVRY-Bib. électronique (912289901) / SudocSudocFranceF

    Deregulation upon DNA damage revealed by joint analysis of context-specific perturbation data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Deregulation between two different cell populations manifests itself in changing gene expression patterns and changing regulatory interactions. Accumulating knowledge about biological networks creates an opportunity to study these changes in their cellular context.</p> <p>Results</p> <p>We analyze re-wiring of regulatory networks based on cell population-specific perturbation data and knowledge about signaling pathways and their target genes. We quantify deregulation by merging regulatory signal from the two cell populations into one score. This joint approach, called JODA, proves advantageous over separate analysis of the cell populations and analysis without incorporation of knowledge. JODA is implemented and freely available in a Bioconductor package 'joda'.</p> <p>Conclusions</p> <p>Using JODA, we show wide-spread re-wiring of gene regulatory networks upon neocarzinostatin-induced DNA damage in Human cells. We recover 645 deregulated genes in thirteen functional clusters performing the rich program of response to damage. We find that the clusters contain many previously characterized neocarzinostatin target genes. We investigate connectivity between those genes, explaining their cooperation in performing the common functions. We review genes with the most extreme deregulation scores, reporting their involvement in response to DNA damage. Finally, we investigate the indirect impact of the ATM pathway on the deregulated genes, and build a hypothetical hierarchy of direct regulation. These results prove that JODA is a step forward to a systems level, mechanistic understanding of changes in gene regulation between different cell populations.</p

    Computational Methods for Knowledge Integration in the Analysis of Large-scale Biological Networks

    Get PDF
    abstract: As we migrate into an era of personalized medicine, understanding how bio-molecules interact with one another to form cellular systems is one of the key focus areas of systems biology. Several challenges such as the dynamic nature of cellular systems, uncertainty due to environmental influences, and the heterogeneity between individual patients render this a difficult task. In the last decade, several algorithms have been proposed to elucidate cellular systems from data, resulting in numerous data-driven hypotheses. However, due to the large number of variables involved in the process, many of which are unknown or not measurable, such computational approaches often lead to a high proportion of false positives. This renders interpretation of the data-driven hypotheses extremely difficult. Consequently, a dismal proportion of these hypotheses are subject to further experimental validation, eventually limiting their potential to augment existing biological knowledge. This dissertation develops a framework of computational methods for the analysis of such data-driven hypotheses leveraging existing biological knowledge. Specifically, I show how biological knowledge can be mapped onto these hypotheses and subsequently augmented through novel hypotheses. Biological hypotheses are learnt in three levels of abstraction -- individual interactions, functional modules and relationships between pathways, corresponding to three complementary aspects of biological systems. The computational methods developed in this dissertation are applied to high throughput cancer data, resulting in novel hypotheses with potentially significant biological impact.Dissertation/ThesisPh.D. Computer Science 201

    Computational approaches to understanding infectious disease

    Full text link
    Infectious diseases derive from organisms such as viruses, bacteria, fungi and parasites that can be passed from person to person, transmitted via bites from insects or animals, or acquired through ingestion of contaminated food or water or environmental exposure. Infectious diseases cause roughly 20% of annual deaths worldwide, including many children under the age of five. In developing countries, these diseases remain a major public health problem. They can also cause societal and economic burdens through life-long disability. We need a better understanding of these diseases with a view towards the goals of prevention and cure. The advent of whole-genome transcriptional profiling technology and powerful computational resources has made it possible to study infectious diseases on a genome-wide scale. Such studies can lead to improvements in diagnostic tools as well as preventive measures such as vaccines. The work of this thesis focuses on a number of projects with the common thread of developing and applying of computational methods to extract biological information from high-throughput transcriptional data related to infectious diseases. These include (1) the identification of gene signatures related to B-cell proliferation that predict an influenza vaccine-induced antibody response; (2) study of the physiological state of the Plasmodium falciparum malaria parasite when sequestered in human tissue; (3) identifying the similarity and differences of the response to five anti-viral vaccines. To achieve the scientific goals of these projects I developed two new computational methods that can be utilized more broadly for the downstream interpretation of results from enrichment analyses of whole transcriptome profiles. There are a combined visualization and annotation approach called the Constellation Map and the Leading Edge Metagene Detector that systematically consolidates functionally related genes from multiple sets representing highly enriched biological pathways and processes in the comparison of expression data of two biological phenotypes. The application of those computational approaches and tools in this dissertation enabled a better understanding of the biological mechanisms related to human vaccine response. The software packages developed are freely available for use by biological investigators across many fields

    Higher-order interactions in single-cell gene expression: towards a cybergenetic semantics of cell state

    Get PDF
    Finding and understanding patterns in gene expression guides our understanding of living organisms, their development, and diseases, but is a challenging and high-dimensional problem as there are many molecules involved. One way to learn about the structure of a gene regulatory network is by studying the interdependencies among its constituents in transcriptomic data sets. These interdependencies could be arbitrarily complex, but almost all current models of gene regulation contain pairwise interactions only, despite experimental evidence existing for higher-order regulation that cannot be decomposed into pairwise mechanisms. I set out to capture these higher-order dependencies in single-cell RNA-seq data using two different approaches. First, I fitted maximum entropy (or Ising) models to expression data by training restricted Boltzmann machines (RBMs). On simulated data, RBMs faithfully reproduced both pairwise and third-order interactions. I then trained RBMs on 37 genes from a scRNA-seq data set of 70k astrocytes from an embryonic mouse. While pairwise and third-order interactions were revealed, the estimates contained a strong omitted variable bias, and there was no statistically sound and tractable way to quantify the uncertainty in the estimates. As a result I next adopted a model-free approach. Estimating model-free interactions (MFIs) in single-cell gene expression data required a quasi-causal graph of conditional dependencies among the genes, which I inferred with an MCMC graph-optimisation algorithm on an initial estimate found by the Peter-Clark algorithm. As the estimates are model-free, MFIs can be interpreted either as mechanistic relationships between the genes, or as substructures in the cell population. On simulated data, MFIs revealed synergy and higher-order mechanisms in various logical and causal dynamics more accurately than any correlation- or information-based quantities. I then estimated MFIs among 1,000 genes, at up to seventh-order, in 20k neurons and 20k astrocytes from two different mouse brain scRNA-seq data sets: one developmental, and one adolescent. I found strong evidence for up to fifth-order interactions, and the MFIs mostly disambiguated direct from indirect regulation by preferentially coupling causally connected genes, whereas correlations persisted across causal chains. Validating the predicted interactions against the Pathway Commons database, gene ontology annotations, and semantic similarity, I found that pairwise MFIs contained different but a similar amount of mechanistic information relative to networks based on correlation. Furthermore, third-order interactions provided evidence of combinatorial regulation by transcription factors and immediate early genes. I then switched focus from mechanism to population structure. Each significant MFI can be assigned a set of single cells that most influence its value. Hierarchical clustering of the MFIs by cell assignment revealed substructures in the cell population corresponding to diverse cell states. This offered a new, purely data-driven view on cell states because the inferred states are not required to localise in gene expression space. Across the four data sets, I found 69 significant and biologically interpretable cell states, where only 9 could be obtained by standard approaches. I identified immature neurons among developing astrocytes and radial glial cells, D1 and D2 medium spiny neurons, D1 MSN subtypes, and cell-cycle related states present across four data sets. I further found evidence for states defined by genes associated to neuropeptide signalling, neuronal activity, myelin metabolism, and genomic imprinting. MFIs thus provide a new, statistically sound method to detect substructure in single-cell gene expression data, identifying cell types, subtypes, or states that can be delocalised in gene expression space and whose hierarchical structure provides a new view on the semantics of cell state. The estimation of the quasi-causal graph, the MFIs, and inference of the associated states is implemented as a publicly available Nextflow pipeline called Stator

    Modeling signal transduction pathways and their transcriptional response

    No full text
    This thesis is concerned with revealing regulation of gene expression. The basic motivation behind our work is that gene regulation can be better resolved when analyzed in a cellular context of the upstream signaling pathway and known regulatory targets. Our source of data are perturbation experiments, which are performed on pathway components and induce changes in gene expression. In such a way, they connect the signaling pathway to its downstream target genes. This chapter starts with an introduction to the cellular con- text considered in the thesis (section 1.1) and the principles of perturbation experiments (section 1.2). We end with a concise summary of three approaches that comprise this thesis. The approaches tackle various problems in the process of revealing context-speci c regulatory networks (section 1.3). We deal with di erential expression analysis of the per- turbation data, enhanced with known transcription factor targets serving as examples of di erential genes (chapter 2), pathway model-based planning of informative perturbation experiments (chapter 3), and nally, with deregulation analysis, i.e., comparing changes in gene regulation between two di erent cell populations (chapter 4)
    • 

    corecore