35 research outputs found

    From Sequence to Structure And Back Again: An Alignment Tale

    Get PDF
    Heringa, J. [Promotor

    Comparative and molecular characterisation of a schizophrenia susceptibility locus

    Get PDF
    A substantial genetic contribution to the aetiology of schizophrenia and other major mental illnesses has been convincingly and repeatedly established by family, twin and adoption studies. However, phenotypic and genetic heterogeneity have severely hampered linkage and association studies, and consequently the molecular basis of the genetic contribution remains undefined. The use of cytogenetic abnormalities to identify disease loci is a well established technique that overcomes many of the problems of linkage and association studies. A balanced t(l;l I)(q42;q14) translocation segregates in a large Scottish family (LOD = 7.1) with schizophrenia and related psychiatric disorders. At least three independent studies have also identified the 1q42 region of the genome as a susceptibility locus for major mental illness. The chromosome 1 breakpoint region now represents one of the best-supported loci for susceptibility to major mental illness. Two novel genes are directly disrupted by the chromosome 1 breakpoint, Disrupted-In-Schizophrenia 1 and 2 (DISCI and DISC2). The central hypothesis of this work is that genes directly disrupted by, or near to the chromosome 1 breakpoint contribute a significant susceptibility to major mental illness. This thesis set out to characterise DISCI, DISC2 and neighboring genes through comparative sequence analysis. Specifically, the research aimed to better define the locus, the genes, their functions and regulatory sequences, to evaluate the functional consequences of the translocation and how these may relate to the t(1;11) phenotype.Human genomic sequence over the breakpoint region was assembled. The DISCI region of the Fugu rubripes genome was cloned and 45 kb of contiguous genomic sequence generated. The orthologous region of the mouse and chicken genomes was identified and characterised. A pipeline for preliminary genomic annotation and subsequent comparative genomic analysis was developed using the cystic fibrosis locus as a model, and subsequently applied to the DISCI locus. The method of "annotation anchored global sequence alignment" substantially increased the sensitivity in detection of biologically relevant conserved sequence motifs. Comparative genomic analysis, RT-PCR and cDNA clone identification were used to construct a transcriptional map of the Fugu genomic region and refine the human transcription map. Conservation of synteny between 0.7 Mb of the human genome and 45 kb of the Fugu genome was demonstrated, with one boundary of synteny being clearly defined. The region of conserved synteny contained the genes Egg Laying Nine-1 (EGLN1), Translin Associated factor X (TRAX) and DISCI in both species.EGLN1 was found to be a member of a previously undescribed gene family. The mouse and human members were identified and characterised. In addition, evolutionary evidence for a novel mechanism of host - pathogen interactions was discovered. TRAX and its homologue Translin were tentatively identified as members of a nucleic acid helicase family of proteins, providing a mechanistic basis for their known biological roles, and suggesting previously undescribed functional aspects of these proteins. DISCI was found to be rapidly evolving in both genomic structure and protein sequence, although three N-terminal motifs and blocks of coiled coil forming potential in the C-terminal half of the protein are conserved features, suggesting a general structure and function for the protein. Neither the antisense transcript DISC2 nor the intergenic splicing of TRAX to DISCI are conserved in Fugu.The work presented in this thesis has substantially enhanced understanding of the chromosome 1 breakpoint locus both at the genomic and encoded protein level. Two novel gene families have been defined and characterised, allowing a more complete evaluation of their functional candidacy in the aetiology of major mental illness. The sequence and clone resources resulting from this work also form the basis for protein functional studies and future characterisation of the locus in animal models

    Tracing the molecular and evolutionary determinants of novel functions in protein families

    Get PDF
    This thesis explores the limits of homology-based inference of protein function and evolution, where overall similarity between sequences can be a poor indicator of functional similarity or evolutionary relationships. Each case presented has undergone different patterns of evolutionary change due to differing selective pressures. Surface adaptations and regulatory (e.g., gene expression) divergence are examined as molecular determinants of novel functions whose patterns are easily missed by assessments of overall sequence similarity. Following this, internal repeats and mosaic sequences are investigated as cases in which key evolutionary events involving fragments of protein sequences are masked by overall comparison. Lastly, virulence factors, which cannot be unified based on sequence, are predicted by analysis of elevated host-mimicry patterns in pathogenic versus non-pathogenic bacterial genomes. These patterns have resulted from unique co-evolutionary pressures that apply to bacterial pathogens, but may be lacking in their close relatives. A recurring theme in the proteins/genes/genomes analyzed is an involvement in microbial pathogenesis or pathogen-defense. Due to the ongoing "evolutionary arms race" between hosts and pathogens, virulence and defense proteins have undergone—and will likely continue to generate—evolutionary novelties. Thus, they demonstrate the necessity to look beyond overall sequence comparison, and assess multiple dimensions of functional innovation in proteins

    Comparative Genomics of Microbial Chemoreceptor Sequence, Structure, and Function

    Get PDF
    Microbial chemotaxis receptors (chemoreceptors) are complex proteins that sense the external environment and signal for flagella-mediated motility, serving as the GPS of the cell. In order to sense a myriad of physicochemical signals and adapt to diverse environmental niches, sensory regions of chemoreceptors are frenetically duplicated, mutated, or lost. Conversely, the chemoreceptor signaling region is a highly conserved protein domain. Extreme conservation of this domain is necessary because it determines very specific helical secondary, tertiary, and quaternary structures of the protein while simultaneously choreographing a network of interactions with the adaptor protein CheW and the histidine kinase CheA. This dichotomous nature has split the chemoreceptor community into two major camps, studying either an organism’s sensory capabilities and physiology or the molecular signal transduction mechanism. Fortunately, the current vast wealth of sequencing data has enabled comparative study of chemoreceptors. Comparative genomics can serve as a bridge between these communities, connecting sequence, structure, and function through comprehensive studies on scales ranging from minute and molecular to global and ecological. Herein are four works in which comparative genomics illuminates unanswered questions across the broad chemoreceptor landscape. First, we used evolutionary histories to refine chemoreceptor interactions in Thermotoga maritima, pairing phylogenetics with x-ray crystallography. Next, we uncovered the origin of a unique chemoreceptor, isolated only from hypervirulent strains of Campylobacter jejuni, by comparing chemoreceptor signaling and sensory regions from Campylobacter and Helicobacter. We then selected the opportunistic human pathogen Pseudomonas aeruginosa to address the question of assigning multiple chemoreceptors to multiple chemotaxis pathways within the same organism. We assigned all P. aeruginosa receptors to pathways using a novel in silico approach by incorporating sequence information spanning the entire taxonomic order Pseudomonadales and beyond. Finally, we surveyed the chemotaxis systems of all environmental, commensal, laboratory, and pathogenic strains of the ubiquitous Escherichia coli, where we discovered an ancestral chemoreceptor gene loss event that may have predisposed a well-studied subpopulation to adopt extra-intestinal pathogenic lifestyles. Overall, comparative genomics is a cutting edge method for comprehensive chemoreceptor study that is poised to promote synergy within and expand the significance of the chemoreceptor field

    Computational characterization of tandem repeat and non-globular proteins

    Get PDF
    The first protein structure to be determined was hemoglobin, a globe-like, water-soluble protein with enzymatic activity. Since then, protein science has been biased towards this type, termed globular. However, over the last decades accumulating experimental evidences suggested the functional importance of their counterpart, non-globular proteins (NGPs). The definition includes tandem repetitions, intrinsically disordered regions, aggregating domains and transmembrane domains. NGPs recognition and classification is essential to shed a light on the so called “dark proteome”, i.e. the large fraction that we know almost nothing about. I contributed to this goal through the development of new resources dedicated to NGPs. My main focus are tandem repeat proteins (TRPs). TRPs are characterized by a repeated sequence which folds into a modular architecture, where modules are called “units”. The unit represents not only the structural but also the evolutionary module and base TRPs classification. TRPs are widespread in all type of organisms, where they carry out fundamental functions. The sequences of TRP units diverge quickly while maintaining their fold, hampering detection by traditional methods for sequence analysis. Conversely, the challenges of structure-based repeats detection lie in the multidimensional nature of the data. Specialized methods have been developed for TRPs identification, however few of them annotate single repeat units. RepeatsDB is a database of TRP structures annotated with the position of repeat units and insertions. I contributed to the new version of RepeatsDB database, which was populated taking advantage of ReUPred, predictor of tandem repeat units. The quality of RepeatsDB data is guaranteed by manual validation, a time-consuming task which requires community annotation efforts. To facilitate this process I developed RepeatsDB-lite, web server for the prediction and refinement of tandem repeats in protein structure. Analysing RepeatsDB data, I compared the sequence- and structure-based classification of TRPs. Moreover, I provided insights on TRPs role in the human proteome by characterizing them in terms of function, protein-protein interaction networks and impact on diseases. As a case study, I characterized Collagen V, a repeat protein associated to Ehlers-Danlos syndrome, identifying genotype-phenotype correlations in relation to its interaction network model. Another category of NGPs is intrinsically disordered proteins (IDPs), devoid of order in their native state. Intrinsic disorder was shown to be prevalent in the human proteome, to play important signaling and regulatory roles and to be frequently involved in disease. I contributed to MobiDB, database of protein disorder and mobility annotations that describes several aspects of NGPs structure and mechanism of function. MobiDB provides consensus predictions and functional annotations for all known protein sequences. A common feature of TRPs, IDPs and other NGPs is that they are characterized by low-complexity regions, where the distribution of amino acids deviates from the common amino acid usage. The functional importance of low complexity regions is strictly related to their non-globular arrangement. I contributed to the field with a critical review focusing on the definition of sequence features of low complexity regions and their relationship to structural features. Finally, I exploited the knowledge acquired on NGPs in the previous studies to design one of the first sequence-based methods for the prediction of protein solubility, SODA. SODA uses the aggregation propensity, intrinsic disorder, hydrophobicity and secondary structure preferences from a sequence to evaluate solubility changes introduced by a mutation. The main envisaged applications of SODA are in protein engineering and in the study of the impact of protein mutations in disease insurgence

    On the evolution of genetic diversity in RNA virus species : uncovering barriers to genetic divergence and gene length in picorna- and nidoviruses

    Get PDF
    This thesis combines the use of standard bioinformatics analyses with the development of new computational techniques to study the evolution and genetic diversity of picornaviruses and nidoviruses. It integrates two lines of research __ genetics-based virus classification and evolutionary dynamics of gene length __ and aims at unveiling commonalities in the biology of these and other RNA viruses as well as assisting applied research in virology.NBIC, European UnionUBL - phd migration 201

    Annotation and comparative analysis of fungal genomes: a hitchhiker's guide to genomics

    Get PDF
    This thesis describes several genome-sequencing projects such as those from the fungi Laccaria bicolor S238N-H82, Glomus intraradices DAOM 197198, Melampsora laricis-populina 98AG31, Puccinia graminis, Pichia pastoris GS115 and Candida bombicola, as well as the one of the haptophyte Emiliania huxleyi CCMP1516. These species are important organisms in many aspects, for instance: L. bicolor and G. intraradices are symbiotic fungi growing associate with trees and present an important ecological niches for promoting tree growth; M. laricis- populina and P. graminis are two devastating fungi threating plants; the tiny yeast P. pastoris is the major protein production platform in the pharmaceutical industry; the biosurfactant production yeast C. bombicola is likely to provide a low ecotoxicity detergent and E. huxleyi places in a unique phylogeny position of chromalveolate and contributes to the global carbon cycle system. The completion of the genome sequence and the subsequent functional studies broaden our understanding of these complex biological systems and promote the species as possible model organisms. However, it is commonly observed that the genome sequencing projects are launched with lots of enthusiasm but often frustratingly difficult to finish. Part of the reason are the ever-increasing expectations regarding quality delivery (both with respect to data and analyses). The Introductory Chapter aims to provide an overview of how best to conduct a genome sequencing project. It explains the importance of understanding the basic biology and genetics of the target organism. It also discusses the latest developments in new (next) generation high throughput sequencing (HTS) technologies, how to handle the data and their applications. The emergence of the new HTS technologies brings the whole biology research into a new frontier. For instance, with the help of the new sequencing technologies, we were able to sequence the genome of our interest, namely Pichia pastoris. This tiny yeast, the analysis of which forms the bulk of this thesis, is an important heterologous production platform because its methanol assimilation properties makes it ideally suitable for large scale industrial production. The unique protein assembly pathway of P. pastoris also attracts much basic research interests. We used the new HTS method to sequence and assemble the GS115 genome into four chromosomes and made it publicly available to the research community (Chapter 2 and Chapter 3). The public release of the GS115 brought broader interests on the comparison of GS115 and its parental strains. By sequencing the parental strain of GS115 with different new sequencing platforms, we identified several point mutations in the coding genes that likely contribute to the higher protein translocation efficiency in GS115. The sequence divergence and copy number variation of rDNA between strains also explains the difference of protein production efficiency (Chapter 4). Before 2008, the Sanger sequencing method was the only technology to obtain high quality complete genomes of eukaryotes. Because of the high cost of the Sanger method, regarding the other genome projects discussed in this thesis, it was necessary to team up with many other partners and to rely on the U.S. Department of Energy Joint Genome Institute (DOE-JGI) and the Broad Institute to generate the genome sequence. The M. larici-populina srain 98AG31 and the Puccinia graminis f. sp. tritici strain CRL 75-36-700-3 are two devastating basidiomycete ‘rusts’ that infect poplar and wheat. Lineage-specific gene family expansions in these two rusts highlight the possible role in their obligate biotrophic life-style. Two large sets of effector-like small-secreted proteins with different pri- mary sequence structures were identified in each organism. The in planta-induced transcriptomic data showed upregulation of these lineage-specific genes and they are likely involved in the establishing of the rust-host interaction. An additional immunolocalization study on M. larici-populina confirmed the accumulation of some candidate effectors in the haustoria and infection hyphae, which is described in Chapter 5

    Exploring functional annotation through genomic and metagenomic data mining

    Get PDF
    Functional profiling of genomes and metagenomes, as well as data mining for novel proteins, all rely on computational methods for functional annotation of protein sequences. Standard methods assign protein function based on detected homology to reference sequences, but often leave behind a significant fraction of hypothetical sequences ("dark matter") that cannot be annotated. To maximize our ability to extract new biological insights from newly sequenced genomes, it is critical to understand the advantages and limitations of homology-based annotation, and explore alternative methods for inferring function. In this thesis, I performed a comprehensive exploration of computational protein annotation, with a focus on bacterial genomes and metagenomes. First, I applied homology-based methods to functionally annotate and analyze original datasets including newly sequenced Streptomyces strains, a wastewater metagenome, and microbial communities involved in vertebrate decomposition. These studies identified genes and functions of interest including cellulases, antibiotic resistance genes, and virulence factors. I then explored the limits of homology-based annotation by measuring annotation coverage, the fraction of annotated proteins in a proteome, across ~27,000 organisms in the microbial tree of life. This study demonstrated a wide range in annotation coverage across bacteria, from 2-86%. In addition, it revealed multiple factors including taxonomy, genome size, and research bias, as heavy influences on the degree to which proteomes could be annotated. To gain biological insights into hypothetical proteins of unknown function, I analyzed 4,049 domains of unknown function (DUFs) from Pfam. Using phylogenomic, taxonomic and metagenomic information, I detected statistical associations between domains and biological traits. Association-based methods uncovered environment, lineage, and/or pathogen associations in just under half of all DUFs and highlighted new families such as DUF4765 as intriguing virulence factor candidates. Finally, I constructed a database of "ORFan" metagenomic sequences that cannot be annotated using standard approaches, and inferred functions for tens of thousands of these sequences using profile-profile comparison approaches. Motif analysis and genomic context validated these predictions, enabling the discovery of hundreds of novel candidate metalloproteases. Protein "dark matter", which includes a large pool of unannotated coding sequences, is an incredible resource to find new proteins and functions of interest, and included are suggestions on how to prioritize these sequences for future study. A combination of homology-based and alternative annotation methods will be most effective for broad functional profiling of genomes and metagenomes, and can push the boundaries for functional interpretation of sequence data

    PROTEIN FUNCTION, DIVERISTY AND FUNCTIONAL INTERPLAY

    Get PDF
    Functional annotations of novel or unknown proteins is one of the central problems in post-genomics bioinformatics research. With the vast expansion of genomic and proteomic data and technologies over the last decade, development of automated function prediction (AFP) methods for large-scale identification of protein function has be-come imperative in many aspects. In this research, we address two important divergences from the “one protein – one function” concept on which all existing AFP methods are developed
    corecore