738 research outputs found

    Algorithms to Explore the Structure and Evolution of Biological Networks

    Get PDF
    High-throughput experimental protocols have revealed thousands of relationships amongst genes and proteins under various conditions. These putative associations are being aggressively mined to decipher the structural and functional architecture of the cell. One useful tool for exploring this data has been computational network analysis. In this thesis, we propose a collection of novel algorithms to explore the structure and evolution of large, noisy, and sparsely annotated biological networks. We first introduce two information-theoretic algorithms to extract interesting patterns and modules embedded in large graphs. The first, graph summarization, uses the minimum description length principle to find compressible parts of the graph. The second, VI-Cut, uses the variation of information to non-parametrically find groups of topologically cohesive and similarly annotated nodes in the network. We show that both algorithms find structure in biological data that is consistent with known biological processes, protein complexes, genetic diseases, and operational taxonomic units. We also propose several algorithms to systematically generate an ensemble of near-optimal network clusterings and show how these multiple views can be used together to identify clustering dynamics that any single solution approach would miss. To facilitate the study of ancient networks, we introduce a framework called ``network archaeology'') for reconstructing the node-by-node and edge-by-edge arrival history of a network. Starting with a present-day network, we apply a probabilistic growth model backwards in time to find high-likelihood previous states of the graph. This allows us to explore how interactions and modules may have evolved over time. In experiments with real-world social and biological networks, we find that our algorithms can recover significant features of ancestral networks that have long since disappeared. Our work is motivated by the need to understand large and complex biological systems that are being revealed to us by imperfect data. As data continues to pour in, we believe that computational network analysis will continue to be an essential tool towards this end

    Retention and integration of gene duplicates in eukaryotes

    Get PDF

    Proceedings of the 97th Annual Virginia Academy of Science Meeting, 2019

    Get PDF
    Proceedings of the 97th Annual Virginia Academy of Science Meeting, May 22-24, 2019, at Old Dominion University, Norfolk, Virginia

    Reconstrução e classificação de sequências de ADN desconhecidas

    Get PDF
    The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.Os contínuos avanços em tecnologias de sequenciação de ADN e técnicas em meta genómica requerem metodologias de reconstrução confiáveis e de classificação precisas para o aumento da diversidade do repositório natural, contribuindo, entretanto, para a descrição e organização dos organismos. No entanto, após a sequenciação e a montagem de-novo, um dos desafios mais complexos advém das sequências de ADN que não correspondem ou se assemelham a qualquer sequencia biológica da literatura. São três as principais razões que contribuem para essa exceção: uma irregularidade emergiu no processo de reconstrução, a sequência do organismo é altamente dissimilar dos organismos da literatura, ou um novo e diferente organismo foi reconstruído. A incapacidade de classificar com eficiência essas sequências desconhecidas aumenta a incerteza da constituição da amostra e desperdiça a oportunidade de descobrir novas espécies, uma vez que muitas vezes são descartadas. Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional eficiente para resolver este desafio com base em um conjunto de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados. O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente, ele classifica as sequências de ADN sem recorrer diretamente a genomas de referência, mas sim às características que as sequências biológicas da espécie compartilham. Especificamente, ele usa apenas recursos extraídos individualmente de cada genoma sem usar comparações de sequência. Além disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de armazenamento seguro de informações sensíveis. O RFSC é então um pipeline de classificação de aprendizagem automática que se baseia em um conjunto de especialistas para fornecer classificação eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados sintéticos e reais, alcançando em ambos resultados precisos e exatos que, no momento do desenvolvimento desta dissertação, não foram relatados na literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma precisão de aproximadamente 97% na classificação de domínio/tipo.Mestrado em Engenharia de Computadores e Telemátic

    Computationally Comparing Biological Networks and Reconstructing Their Evolution

    Get PDF
    Biological networks, such as protein-protein interaction, regulatory, or metabolic networks, provide information about biological function, beyond what can be gleaned from sequence alone. Unfortunately, most computational problems associated with these networks are NP-hard. In this dissertation, we develop algorithms to tackle numerous fundamental problems in the study of biological networks. First, we present a system for classifying the binding affinity of peptides to a diverse array of immunoglobulin antibodies. Computational approaches to this problem are integral to virtual screening and modern drug discovery. Our system is based on an ensemble of support vector machines and exhibits state-of-the-art performance. It placed 1st in the 2010 DREAM5 competition. Second, we investigate the problem of biological network alignment. Aligning the biological networks of different species allows for the discovery of shared structures and conserved pathways. We introduce an original procedure for network alignment based on a novel topological node signature. The pairwise global alignments of biological networks produced by our procedure, when evaluated under multiple metrics, are both more accurate and more robust to noise than those of previous work. Next, we explore the problem of ancestral network reconstruction. Knowing the state of ancestral networks allows us to examine how biological pathways have evolved, and how pathways in extant species have diverged from that of their common ancestor. We describe a novel framework for representing the evolutionary histories of biological networks and present efficient algorithms for reconstructing either a single parsimonious evolutionary history, or an ensemble of near-optimal histories. Under multiple models of network evolution, our approaches are effective at inferring the ancestral network interactions. Additionally, the ensemble approach is robust to noisy input, and can be used to impute missing interactions in experimental data. Finally, we introduce a framework, GrowCode, for learning network growth models. While previous work focuses on developing growth models manually, or on procedures for learning parameters for existing models, GrowCode learns fundamentally new growth models that match target networks in a flexible and user-defined way. We show that models learned by GrowCode produce networks whose target properties match those of real-world networks more closely than existing models

    Program and Proceedings: The Nebraska Academy of Sciences 1880-2013

    Get PDF
    PROGRAM FRIDAY, APRIL 19, 2013 REGISTRATION FOR ACADEMY, Lobby of Lecture wing, Olin Hall Aeronautics and Space Science, Session A, Olin 249 Aeronautics and Space Science, Session B, Olin 224 Collegiate Academy, Biology Session A, Olin B Biological and Medical Sciences, Session A, Olin 112 Biological and Medical Sciences, Session B, Smith Callen Conference Center NE Chapter, Nat\u27l Council For Geographic Education, Olin 325 Junior Academy, Judges Check-In, Olin 219 Junior Academy, Senior High REGISTRATION, Olin Hall Lobby Chemistry and Physics, Section A, Chemistry, Olin A Chemistry and Physics, Section B, Physics, Planetarium Collegiate Academy, Chemistry and Physics, Session A, Olin 324 Junior Academy, Senior High Competition, Olin 124, Olin 131 Aeronautics and Space Science, Poster Session, Olin 249 Anthropology, Olin 111 NWU Health and Sciences Graduate School Fair, Olin and Smith Curtiss Halls Aeronautics and Space Science, Poster Session, Olin 249 MAIBEN MEMORIAL LECTURE, OLIN B Bob Feurer, North Bend High School, Making People Smarter Using Habits of Mind LUNCH, PATIO ROOM, STORY STUDENT CENTER (pay and carry tray through cafeteria line, or pay at NAS registration desk) Aeronautics Group, Sunflower Room Biological and Medical Sciences, Session C, Olin 112 Biological and Medical Sciences, Session D, Smith Callen Conference Center Chemistry and Physics, Section A, Chemistry, Olin A Collegiate Academy, Biology Session A, Olin B Collegiate Academy, Biology Session B, Olin 249 Collegiate Academy, Chemistry and Physics, Session B, Olin 324 Junior Academy, Judges Check-In, Olin 219 Junior Academy, Junior High REGISTRATION, Olin Hall Lobby Junior Academy, Senior High Competition, (Final), Olin 110 Anthropology, Olin 111 Teaching of Science and Math, Olin 224 Applied Science and Technology, Olin 325 Junior Academy, Junior High Competition, Olin 124, Olin 131 NJAS Board/Teacher Meeting, Olin 219 BUSINESS MEETING, OLIN B AWARDS RECEPTION for NJAS, Scholarships, Members, Spouses, and Guests First United Methodist Church, 2723 N 50th Street, Lincoln, N

    Integration of protein binding interfaces and abundance data reveals evolutionary pressures in protein networks

    Get PDF
    Networks of protein-protein interactions have received considerable interest in the past two decades for their insights about protein function and evolution. Traditionally, these networks only map the functional partners of proteins; they lack further levels of data such as binding affinity, allosteric regulation, competitive vs noncompetitive binding, and protein abundance. Recent experiments have made such data on a network-wide scale available, and in this thesis I integrate two extra layers of data in particular: the binding sites that proteins use to interact with their partners, and the abundance or “copy numbers” of the proteins. By analyzing the networks for the clathrin-mediated endocytosis (CME) system in yeast and the ErbB signaling pathway in humans, I find that this extra data reveals new insights about the evolution of protein networks. The structure of the binding site or interface interaction network (IIN) is optimized to allow higher binding specificity; that is, a high gap in strength between functional binding and nonfunctional mis-binding. This strongly implies that mis-binding is an evolutionary error-load constraint shaping protein network structure. Another method to limit mis-binding is to balance protein copy numbers so that there are no “leftover” proteins available for mis-binding. By developing a new method to quantify balance in IINs, I show that the CME network is significantly balanced when compared to randomly sampled sets of copy numbers. Furthermore, IINs with a biologically realistic structure produce less mis-binding under balanced concentrations, when compared to random networks, but more mis-binding under unbalanced concentrations. This implies strong pressure for copy number balance and that any imbalance should occur for functional reasons. I thus explore some functional consequences of imbalance by constructing dynamic models of two poorly balanced subnetworks of the larger CME network. In general, I find that balanced copy numbers provide higher protein complex yield (number of complete complexes), but imbalance may allow cells to “bottleneck” a functional process, effectively turning complex formation on or off via spatial localization of subunits. Finally, I find that strongly binding proteins are more likely to be balanced, as these “sticky” proteins would be more likely to engage in mid-binding otherwise
    corecore