243 research outputs found

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

    A new method for identifying site-specific evolutionary rates and its applications.

    Get PDF
    In this thesis, I discuss each stage in the development of a new method for identifying site specific evolutionary rates, from conception of the idea, through the implementation to its application to data. TIGER, or tree independent generation of evolutionary rates, is based largely around the works of LeQuesne (1989), Wilkinson (1998) and Pisani (2004) and the premise that sites in a multi-state character matrix could be scored based on the level of agreement it displays with the other sites. In these earlier studies, however, agreement was measured in binary manner: sites were either compatible with each other or they are not. TIGER allows various degrees of agreement to occur between two sites, allowing it to pick up more subtle signals in the data. After implementing the method into a software program, it could be applied to data. Using a combination of simulated and empirical datasets, TIGER was shown to produce desirable results. In particular, removal of sites identified by TIGER was shown to improve phylogenetic reconstruction of deeply diverging lineages and of taxa displaying compositional attraction. Additionally, TIGER was applied to a gene content matrix in order to identify HGT signals and integrated into the analysis of a current phylogenetic problem, the origin of the mitochondria. Although it is widely accepted that eukaryotes have a chimeric genome, the specific “parent” of the mitochondria is, as of yet, unclear. Previous studies have failed to reach agreement regarding this issue for a number of reasons. Exploration of the signals using TIGER and heterogeneous modelling reveal that multiple signals and compositional heterogeneity are among the biggest problems with datasets containing both mitochondrial and a-proteobacterial sequences

    AutoCoEv-A High-Throughput In Silico Pipeline for Predicting Inter-Protein Coevolution

    Get PDF
    Protein-protein interactions govern cellular processes via complex regulatory networks, which are still far from being understood. Thus, identifying and understanding connections between proteins can significantly facilitate our comprehension of the mechanistic principles of protein functions. Coevolution between proteins is a sign of functional communication and, as such, provides a powerful approach to search for novel direct or indirect molecular partners. However, an evolutionary analysis of large arrays of proteins in silico is a highly time-consuming effort that has limited the usage of this method for protein pairs or small protein groups. Here, we developed AutoCoEv, a user-friendly, open source, computational pipeline for the search of coevolution between a large number of proteins. By driving 15 individual programs, culminating in CAPS2 as the software for detecting coevolution, AutoCoEv achieves a seamless automation and parallelization of the workflow. Importantly, we provide a patch to the CAPS2 source code to strengthen its statistical output, allowing for multiple comparison corrections and an enhanced analysis of the results. We apply the pipeline to inspect coevolution among 324 proteins identified to be located at the vicinity of the lipid rafts of B lymphocytes. We successfully detected multiple coevolutionary relations between the proteins, predicting many novel partners and previously unidentified clusters of functionally related molecules. We conclude that AutoCoEv, can be used to predict functional interactions from large datasets in a time- and cost-efficient manner

    Bioinformatics and Next Generation Sequencing: Applications of Arthropod Genomes

    Get PDF
    Over the past decade, the Next Generation Sequencing (NGS) technology has been broadly applied in many areas such as genomics, medical diagnosis, biotechnology, virology, biological systematics, forensic biology, and anthropology. Taken together, it has offered us brilliant insights into life sciences. Most of the work presented in this thesis describes NGS applications on genome assembly, genome annotation, and comparative genomics, using arthropods as case studies: (1) by sequencing and analyzing the genomes of three Tetranychus spider mites with three completely different feeding behaviors, we uncovered genomic signature variations and indicative of pest adaptations; (2) we sequenced, assembled and annotated five Brevipalpus flat mite genomes and their corresponding endosymbiont Cardinium genomes. Comparative genomics reveals herbivorous pest adaptations and parthenogenesis; (3) the complete genomic analysis of parasitoid wasp Copidosoma floridanum indicates the mechanism of polyembryony of such primary parasite of moths. By bioinformatics and genomics approaches, my study provides the genomic basis and establishes the hypotheses for the future biology in pest and arthropod researches. These NGS applications of arthropod genomes will offer new insights into arthropod evolution and plant-herbivore interactions, open unique opportunities to develop novel plant protection strategies, and additionally, provide arthropod genomic resources as well

    Probabilistic Protein Design, Comparative Modeling, and the Structure of a Multidomain P53 Oligomer Bound to DNA

    Get PDF
    Proteins are the main functional components of all cellular processes, and most of them fold into unique three-dimensional shapes guided by their amino-acid sequence. Discovering the structure of a protein, or protein complexes, can provide important clues about how they perform their function. However, the chemical, physical or architectural properties of many proteins impede traditional approaches to structure determination. Two such proteins, the tumor suppressor p53 and the cholesterol processing enzyme endothelial lipase, are prime examples of problematic proteins that defy structural investigation via crystallographic methods. Therefore, new techniques must be developed to gain valuable structural insights, such as: computationally assisted protein design strategies, more efficient crystal screening, or a combination of both. We applied a statistical computationally assisted design strategy to stabilize a p53 variant consisting of two independently folding domains. The re-engineered variant retained normal DNA-binding activities, and allowed us to experimentally determine the first structure of a physiologically active multi-domain p53 tetramer bound to a full-length DNA response element. We then demonstrated how computational methodology can be used to gain functional detail of proteins in the absence of experimentally determined structures. By creating comparative models of endothelial lipase, we discovered structural features that describe function and regulation, and gained a better understanding of the mechanisms conferring substrate specificity. Additionally, traditional methods for protein structure determination, such as X-ray crystallography, require relatively large amounts of purified sample in order to screen a sufficient variety of conditions. To improve this process, we developed a novel method for protein crystal screening using a microfluidics platform. We show how it is possible to use smaller quantities of protein to screen larger varieties of conditions, in turn increasing the probability of success in obtaining crystals. Furthermore, in contrast to current crystallographic approaches, all steps from screening to crystal growth to data collection were performed within the same reaction chamber, without any manipulation of the crystal, dramatically increasing the efficiency of both time and sample required to realize the structure. Collectively, these results demonstrate how advances in computational and experimental approaches can provide structural detail for proteins in circumstances where traditional methodology fails
    • …
    corecore