284 research outputs found

    Structural, evolutionary and functional analysis of the NAC domain protein family in Eucalyptus

    Get PDF
    NAC domain transcription factors regulate many developmental processes and stress responses in plants and vary widely in number and family structure. We analysed the characteristics and evolution of the NAC gene family of Eucalyptus grandis, a fastgrowing forest tree in the rosid order Myrtales. NAC domain genes identified in the E. grandis genome were subjected to amino acid sequence, phylogenetic and motif analyses. Transcript abundance in developing tissues and abiotic stress conditions in E. grandis and E. globulus was quantified using RNA-seq and RT-qPCR.189 E. grandis NAC (EgrNAC) proteins, arranged into 22 subfamilies, are extensively duplicated in subfamilies associated with stress response. Most EgrNAC genes form tandem duplicate arrays that frequently carry signatures of purifying selection. Sixteen amino acid motifs were identified in EgrNAC proteins, eight of which are enriched in, or unique to, Eucalyptus. New candidates for the regulation of normal and tension wood development and cold responses were identified.This first description of a Myrtales NAC domain family reveals a unique history of tandem duplication in stress-related subfamilies that has likely contributed to the adaptation of eucalypts to the challenging Australian environment. Several new candidates for the regulation of stress, wood formation and tree-specific development are reported.ANR (Project “Tree For Joules” ANR-2010-KBBE-007-01; Labex Tulip ANR-10-LABX-41), the CNRS, and the University Toulouse III (UPS). Bioinformatics and Functional Genomics Programme of the National Research Foundation, South Africa (UID 71255).http://onlinelibrary.wiley.com/journal/10.1111/(ISSN)1469-81372016-06-30hb201

    Variant Surface Antigens Of Malaria Parasites: Functional And Evolutionary Insights From Comparative Gene Family Classification And Analysis

    Get PDF
    Background Plasmodium parasites, the causative agents of malaria, express many variant antigens on cell surfaces. Variant surface antigens (VSAs) are typically organized into large subtelomeric gene families that play critical roles in virulence and immune evasion. Many important aspects of VSA function and evolution remain obscure, impeding our understanding of virulence mechanisms and vaccine development. To gain further insights into VSA function and evolution, we comparatively classified and examined known VSA gene families across seven Plasmodium species. Results We identified a set of ultra-conserved orthologs within the largest Plasmodium gene family pir, which should be considered as high-priority targets for experimental functional characterization and vaccine development. Furthermore, we predict a lipid-binding domain in erythrocyte surface-expressed PYST-A proteins, suggesting a role of this second largest rodent parasite gene family in host cholesterol salvage. Additionally, it was found that PfMC-2TM proteins carry a novel and putative functional domain named MC-TYR, which is conserved in other P. falciparum gene families and rodent parasites. Finally, we present new conclusive evidence that the major Plasmodium VSAs PfEMP1, SICAvar, and SURFIN are evolutionarily linked through a modular and structurally conserved intracellular domain. Conclusion Our comparative analysis of Plasmodium VSA gene families revealed important functional and evolutionary insights, which can now serve as starting points for further experimental studies

    gene regulatory element prediction with bayesian networks

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    New approaches to facilitate genome analysis

    Get PDF
    In this era of concerted genome sequencing efforts, biological sequence information is abundant. With many prokaryotic and simple eukaryotic genomes completed, and with the genomes of more complex organisms nearing completion, the bioinformatics community, those charged with the interpretation of these data, are becoming concerned with the efficacy of current analysis tools. One step towards a more complete understanding of biology at the molecular level is the unambiguous functional assignment of every newly sequenced protein. The sheer scale of this problem precludes the conventional process of biochemically determining function for every example. Rather we must rely on demonstrating similarity to previously characterised proteins via computational methods, which can then be used to infer homology and hence structural and functional relationships. Our ability to do this with any measure of reliability unfortunately diminishes as the pools of experimentally determined sequence data become muddied with sequences that are themselves characterised with "in silico" annotation.Part of the problem stems from the complexity of modelling biology in general, and of evolution in particular. For example, once similarity has been identified between sequences, in order to assign a common function it is important to identify whether the inferred homologous relationship has an orthologous or paralogous origin, which currently cannot be done computationally. The modularity of proteins also poses problems for automatic annotation, as similar domains may occur in proteins with very different functions. Once accepted into the sequence databases, incorrect functional assignments become available for mass propagation and the consequences of incorporating those errors in further "in silico" experiments are potentially catastrophic. One solution to this problem is to collate families of proteins with demonstrable homologous relationships, derive a pattern that represents the essence of those relationships, and use this as a signature to trawl for similarity in the sequence databases. This approach not only provides a more sensitive model of evolution, but also allows annotation from all members of the family to contribute to any assignments made. This thesis describes the development of a new search method (FingerPRINTScan) that exploits the familial models in the PRINTS database to provide more powerful diagnosis of evolutionary relationships. FingerPRINTScan is both selective and sensitive, allowing both precise identification of super-family, family and sub-family relationships, and the detection of more distant ones. Illustrations of the diagnostic performance of the method are given with respect to the haemoglobin and transfer RNA synthetase families, and whole genome data.FingerPRINTScan has become widely used in the biological community, e.g. as the primary search interface to PRINTS via a dedicated web site at the university of Manchester, and as one of the search components of InterPro at the European Bioinformatics Institute (EBI). Furthermore, it is currently responsible for facilitating the use of PRINTS in a number of significant annotation roles, such as the automatic annotation of TrEMBL at the EBI, and as part of the computational suite used to annotate the Drosophila melanogaster genome at Celera Genomics

    Unsupervised and semi-supervised training methods for eukaryotic gene prediction

    Get PDF
    This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing. Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns. The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments. Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof

    Comprehensive phylogenetic study of ECF sigma factors

    Get PDF
    Extracytoplasmic function (ECF) σ factors are the most minimalistic member of the σ70 family. ECFs and their activity regulators are one of the main signal transduction mechanisms that allow bacteria to respond to extracellular changes. Aside from their natural role in bacterial homeostasis, ECFs are generally host independent and functionally orthogonal, which makes them especially attractive for constructing bacterial synthetic circuits. In silico identification of sets of ECFs, their target promoters and their regulators is particularly simple since ECFs and their regulators are typically encoded in the same genetic neighborhood and usually in the same operon, and ECFs usually target their own promoter. Earlier works on the phylogenetic classification of ECFs revealed that there is a correlation between ECF groups, which harbor proteins with a similar sequence, regulator type and target promoter motif elements. This showed that the phylogenetic classification of ECFs is essential to understand their modes of regulation. The large number of sequenced bacterial genomes currently deposited in databases suggests that an ECF reclassification would expand our knowledge on ECF regulation. This thesis addresses the analysis of the main modes of regulation found in the comprehensive classification of ECF σ factor subfamily. For this study, I first extracted ECFs from all bacterial genomes deposited in NCBI. I identified more than 170,000 unique protein sequences that are likely to function as ECFs. This resulted in a 50-fold expansion over the original ECF library. Then, I classified the conserved σ domains of these proteins into more than 150 phylogenetic groups, each associated to a conserved type of regulator. I systematically described each ECF group in terms of its putative regulator, putative target promoter, taxonomic distribution and putative function. I confirmed these predictions for groups with described members. Anti-σ factors are the main type of ECF regulator across groups, followed by C-terminal extensions of their protein and serine/threonine kinases, which have been suggested to phosphorylate ECFs. I hypothesized new alternative types of regulators for some ECF groups. Using a combination of bioinformatic tools and collaborating with different experimental research groups, I focused on the most important regulatory elements of ECFs to shed light into their mechanism of regulation. In the case of anti-σ factors, I focused on their most common type, class I anti-σ factors, to reveal two shared binding interfaces between ECFs and these inhibitors. Then, I focused on the three largest ECF groups associated to C-terminal extensions, showing a different role of this additional region in the control of ECF activity in the different groups. Lastly, I focused on serine/threonine kinases to find that phosphorylation compensates for the lack of negative charges in one of the main RNA polymerase binding surfaces of ECF σ factors. In summary, this thesis provides the scientific community with a comprehensive overview of ECF σ factor regulation, target promoter and function across phylogenetic groups, and sheds light into some of their most important regulatory mechanisms

    IDENTIFICATION OF FUNCTIONAL DOMAINS IN NON-CODING RNA

    Get PDF
    Long non-coding RNAs (lncRNAs) are known to be important regulators of gene expres- sion and other cellular functions. However, only a very small proportion of lncRNAs have been extensively studied. The remainder exist largely as annotations in a database with no known function, if any. A primary challenge to understanding how lncRNAs function is the poorly understood sequence-to-function relationship relative to protein coding genes. Within lncRNA transcripts, boundaries of functional sequence are not explicitly defined by exon-intron boundaries, and the code by which lncRNAs derive function is not nearly as explicit as in a protein coding reading frame. To address these challenges, we have de- veloped a probabilistic framework, hmmSEEKR, for identifying where within a non-coding RNA functional regions may be located based off of enrichment of short motifs, or k-mers. We used hmmSEEKR to identify functional sequence domains in several lncRNAs that silence gene expression through recruitment of Polycomb, using XIST as a model tran- script. These predicted sequence domains share no detectable linear sequence alignment with XIST; however, they share high k-mer based similarity with known functional do- mains in XIST and precisely coincided with the location of RNA binding protein (RBP) interactions known to be important for Polycomb mediated silencing. Furthermore, we were able to extend our analysis to the entire transcriptome and identify many XIST-like sequence domains throughout the transcriptome that interact with Polycomb-associated RBPs. We have packaged these algorithms into python-based software and have included an in-depth walk-through of the code and tutorial of how to analyze sequences using hmm- SEEKR.Doctor of Philosoph

    Innovative Algorithms and Evaluation Methods for Biological Motif Finding

    Get PDF
    Biological motifs are defined as overly recurring sub-patterns in biological systems. Sequence motifs and network motifs are the examples of biological motifs. Due to the wide range of applications, many algorithms and computational tools have been developed for efficient search for biological motifs. Therefore, there are more computationally derived motifs than experimentally validated motifs, and how to validate the biological significance of the ‘candidate motifs’ becomes an important question. Some of sequence motifs are verified by their structural similarities or their functional roles in DNA or protein sequences, and stored in databases. However, biological role of network motifs is still invalidated and currently no databases exist for this purpose. In this thesis, we focus not only on the computational efficiency but also on the biological meanings of the motifs. We provide an efficient way to incorporate biological information with clustering analysis methods: For example, a sparse nonnegative matrix factorization (SNMF) method is used with Chou-Fasman parameters for the protein motif finding. Biological network motifs are searched by various clustering algorithms with Gene ontology (GO) information. Experimental results show that the algorithms perform better than existing algorithms by producing a larger number of high-quality of biological motifs. In addition, we apply biological network motifs for the discovery of essential proteins. Essential proteins are defined as a minimum set of proteins which are vital for development to a fertile adult and in a cellular life in an organism. We design a new centrality algorithm with biological network motifs, named MCGO, and score proteins in a protein-protein interaction (PPI) network to find essential proteins. MCGO is also combined with other centrality measures to predict essential proteins using machine learning techniques. We have three contributions to the study of biological motifs through this thesis; 1) Clustering analysis is efficiently used in this work and biological information is easily integrated with the analysis; 2) We focus more on the biological meanings of motifs by adding biological knowledge in the algorithms and by suggesting biologically related evaluation methods. 3) Biological network motifs are successfully applied to a practical application of prediction of essential proteins

    Application of machine learning and deep learning for proteomics data analysis

    Get PDF
    corecore