243 research outputs found
Robust Algorithms for Detecting Hidden Structure in Biological Data
Biological data, such as molecular abundance measurements and protein
sequences, harbor complex hidden structure that reflects its underlying
biological mechanisms. For example, high-throughput abundance measurements
provide a snapshot the global state of a living cell, while homologous
protein sequences encode the residue-level logic of the proteins\u27 function
and provide a snapshot of the evolutionary trajectory of the protein family.
In this work I describe algorithmic approaches and analysis software I
developed for uncovering hidden structure in both kinds of data.
Clustering is an unsurpervised machine learning technique commonly used
to map the structure of data collected in high-throughput experiments,
such as quantification of gene expression by DNA microarrays or
short-read sequencing. Clustering algorithms always yield a partitioning
of the data, but relying on a single partitioning solution can lead to
spurious conclusions. In particular, noise in the data can cause objects
to fall into the same cluster by chance rather than due to meaningful
association. In the first part of this thesis I demonstrate approaches to
clustering data robustly in the presence of noise and apply robust clustering
to analyze the transcriptional response to injury in a neuron cell.
In the second part of this thesis I describe identifying hidden specificity
determining residues (SDPs) from alignments of protein sequences descended
through gene duplication from a common ancestor (paralogs) and apply the
approach to identify numerous putative SDPs in bacterial transcription
factors in the LacI family. Finally, I describe and demonstrate a new
algorithm for reconstructing the history of duplications by which paralogs
descended from their common ancestor. This algorithm addresses the
complexity of such reconstruction due to indeterminate or erroneous
homology assignments made by sequence alignment algorithms and to the
vast prevalence of divergence through speciation over divergence through
gene duplication in protein evolution
A new method for identifying site-specific evolutionary rates and its applications.
In this thesis, I discuss each stage in the development of a new method for identifying
site specific evolutionary rates, from conception of the idea, through the
implementation to its application to data. TIGER, or tree independent generation of
evolutionary rates, is based largely around the works of LeQuesne (1989), Wilkinson
(1998) and Pisani (2004) and the premise that sites in a multi-state character matrix
could be scored based on the level of agreement it displays with the other sites. In
these earlier studies, however, agreement was measured in binary manner: sites were
either compatible with each other or they are not. TIGER allows various degrees of
agreement to occur between two sites, allowing it to pick up more subtle signals in the
data.
After implementing the method into a software program, it could be applied to data.
Using a combination of simulated and empirical datasets, TIGER was shown to
produce desirable results. In particular, removal of sites identified by TIGER was
shown to improve phylogenetic reconstruction of deeply diverging lineages and of
taxa displaying compositional attraction. Additionally, TIGER was applied to a gene
content matrix in order to identify HGT signals and integrated into the analysis of a
current phylogenetic problem, the origin of the mitochondria.
Although it is widely accepted that eukaryotes have a chimeric genome, the specific
“parent” of the mitochondria is, as of yet, unclear. Previous studies have failed to
reach agreement regarding this issue for a number of reasons. Exploration of the
signals using TIGER and heterogeneous modelling reveal that multiple signals and
compositional heterogeneity are among the biggest problems with datasets containing
both mitochondrial and a-proteobacterial sequences
AutoCoEv-A High-Throughput In Silico Pipeline for Predicting Inter-Protein Coevolution
Protein-protein interactions govern cellular processes via complex regulatory networks, which are still far from being understood. Thus, identifying and understanding connections between proteins can significantly facilitate our comprehension of the mechanistic principles of protein functions. Coevolution between proteins is a sign of functional communication and, as such, provides a powerful approach to search for novel direct or indirect molecular partners. However, an evolutionary analysis of large arrays of proteins in silico is a highly time-consuming effort that has limited the usage of this method for protein pairs or small protein groups. Here, we developed AutoCoEv, a user-friendly, open source, computational pipeline for the search of coevolution between a large number of proteins. By driving 15 individual programs, culminating in CAPS2 as the software for detecting coevolution, AutoCoEv achieves a seamless automation and parallelization of the workflow. Importantly, we provide a patch to the CAPS2 source code to strengthen its statistical output, allowing for multiple comparison corrections and an enhanced analysis of the results. We apply the pipeline to inspect coevolution among 324 proteins identified to be located at the vicinity of the lipid rafts of B lymphocytes. We successfully detected multiple coevolutionary relations between the proteins, predicting many novel partners and previously unidentified clusters of functionally related molecules. We conclude that AutoCoEv, can be used to predict functional interactions from large datasets in a time- and cost-efficient manner
Recommended from our members
Fusion genes in breast cancer
Fusion genes caused by chromosomal rearrangements are a common and important feature in
haematological malignancies, but have until recently been seen as unimportant in epithelial
cancers. The discovery of recurrent fusion genes in prostate and lung cancer suggests that
fusion genes may play an important role in epithelial carcinogenesis, and that they have been
previously under-reported due to the difficulties of cytogenetic analysis of solid tumours. In
particular, breast cancers often have complex, highly rearranged karyotypes which have proved
difficult to analyse using classical cytogenetic techniques.
The aim of this project was to search for fusion genes in breast cancer by using high-resolution
mapping of chromosome rearrangements in breast cancer cell lines. Mapping the chromosome
rearrangements was initially done using high-resolution DNA microarrays and fluorescence in-
situ hybridisation, but moved to high-throughput sequencing as it became available. Interesting
candidate genes identified from the mapped chromosome rearrangements were investigated
on a larger set of cell lines and primary tumours.
The complete karyotypes of two breast cancer cell lines were constructed using a combination
of microarrays, fluorescence microscopy, and high-throughput sequencing. A number of
potential fusion genes were identified in these two cell lines. Although no expressed fusion
genes were found, the complete karyotypes gave insight into the number and mechanisms of
chromosome rearrangement in breast cancer, and identified interesting candidate genes which
may be of importance in tumourigenesis. Two genes which were fused in other breast cancer
cell lines, BCAS3 and ODZ4, were disrupted by chromosome rearrangements and identified as
interesting candidate genes in tumorigenesis.
A bioinformatic pipeline to process high-throughput sequencing data was set up and validated,
and shown to more accurately predict fusion genes than other methods, and can be used to
investigate further cell lines and tumours for recurrent fusion genes. The pipeline was used to
analyse data from 3 other breast cancer cell lines and predict chromosomal rearrangements
and fusion genes, several of which were found to be expressed. Of the fusions predicted in the
cell line ZR-75-30, 7 expressed fusion genes were identified, and may have functional
significance in breast cancer.This work was supported by a grant from Breast Cancer Campaign
Bioinformatics and Next Generation Sequencing: Applications of Arthropod Genomes
Over the past decade, the Next Generation Sequencing (NGS) technology has been broadly applied in many areas such as genomics, medical diagnosis, biotechnology, virology, biological systematics, forensic biology, and anthropology. Taken together, it has offered us brilliant insights into life sciences. Most of the work presented in this thesis describes NGS applications on genome assembly, genome annotation, and comparative genomics, using arthropods as case studies: (1) by sequencing and analyzing the genomes of three Tetranychus spider mites with three completely different feeding behaviors, we uncovered genomic signature variations and indicative of pest adaptations; (2) we sequenced, assembled and annotated five Brevipalpus flat mite genomes and their corresponding endosymbiont Cardinium genomes. Comparative genomics reveals herbivorous pest adaptations and parthenogenesis; (3) the complete genomic analysis of parasitoid wasp Copidosoma floridanum indicates the mechanism of polyembryony of such primary parasite of moths. By bioinformatics and genomics approaches, my study provides the genomic basis and establishes the hypotheses for the future biology in pest and arthropod researches. These NGS applications of arthropod genomes will offer new insights into arthropod evolution and plant-herbivore interactions, open unique opportunities to develop novel plant protection strategies, and additionally, provide arthropod genomic resources as well
Probabilistic Protein Design, Comparative Modeling, and the Structure of a Multidomain P53 Oligomer Bound to DNA
Proteins are the main functional components of all cellular processes, and most of them fold into unique three-dimensional shapes guided by their amino-acid sequence. Discovering the structure of a protein, or protein complexes, can provide important clues about how they perform their function. However, the chemical, physical or architectural properties of many proteins impede traditional approaches to structure determination. Two such proteins, the tumor suppressor p53 and the cholesterol processing enzyme endothelial lipase, are prime examples of problematic proteins that defy structural investigation via crystallographic methods. Therefore, new techniques must be developed to gain valuable structural insights, such as: computationally assisted protein design strategies, more efficient crystal screening, or a combination of both.
We applied a statistical computationally assisted design strategy to stabilize a p53 variant consisting of two independently folding domains. The re-engineered variant retained normal DNA-binding activities, and allowed us to experimentally determine the first structure of a physiologically active multi-domain p53 tetramer bound to a full-length DNA response element. We then demonstrated how computational methodology can be used to gain functional detail of proteins in the absence of experimentally determined structures. By creating comparative models of endothelial lipase, we discovered structural features that describe function and regulation, and gained a better understanding of the mechanisms conferring substrate specificity.
Additionally, traditional methods for protein structure determination, such as X-ray crystallography, require relatively large amounts of purified sample in order to screen a sufficient variety of conditions. To improve this process, we developed a novel method for protein crystal screening using a microfluidics platform. We show how it is possible to use smaller quantities of protein to screen larger varieties of conditions, in turn increasing the probability of success in obtaining crystals. Furthermore, in contrast to current crystallographic approaches, all steps from screening to crystal growth to data collection were performed within the same reaction chamber, without any manipulation of the crystal, dramatically increasing the efficiency of both time and sample required to realize the structure. Collectively, these results demonstrate how advances in computational and experimental approaches can provide structural detail for proteins in circumstances where traditional methodology fails
- …