9 research outputs found

    Automatic and manual functional annotation in a distributed web service environment

    Get PDF
    While the number of genomic sequences becoming available is increasing exponentially, most genes are not functionally well characterized. Finding out more about the function of a gene and about functional relationships between genes will be the next big bottleneck in the post-genomic era. On the one hand improved pipelines and tools are needed in this context, because running experiments for all predicted genes is not feasible. On the other hand manual curation of the automatic predictions is necessary to judge the reliability of the automatic annotation and to get a more comprehensive view on the function of each individual gene. For the automatic functional annotation often a homology based function transfer from functionally characterized genes is applied using methods like Blast. However, this approach has many drawbacks and makes systematic errors by not taking care of speciation and duplication events. Phylogenomics has shown to improve the functional prediction accuracy by taking the evolutionary history of genes in a phylogenetic tree context into account. In this thesis the manual process from the assembly of the DNA sequence to the functional characterization of genes and the identification and comparison of shared syntenic regions, including the identification of candidate genes for pathogen resistance in potato chromosome V, is explained and problems discussed. To improve the automatic functional annotation in genome projects, a phylogenomic pipeline, which includes SIFTER one of the best phylogenomic tools in this area, is introduced, improved and tested in the Medicago truncatula, Sorghum bicolor and Solanum lycopersicum genome projects. To obtain new candidate genes for the development of new drugs and crop protection products, non-plant specific genes, like the transferrin family which is not known in plants yet, are extracted from the M. truncatula and S. bicolor genomes and further investigated. For further improvement of the annotation, a new phylogenomic approach is developed. This approach makes use of annotated functional attributes to calculate the functional mutation rate between genes and groups of genes in a phylogenetic tree and to find out if the function of a gene can be transferred or not. The new approach is integrated into the SIFTER tool and tested on the blue-light photoreceptor/photolyase family and on a test set of manually curated Arabidopsis thaliana genes. Using both test sets the prediction accuracy could be significantly improved and a more comprehensive view on the gene function could be obtained. But because still no tool is able to annotate all functions of a gene with 100% accuracy, I introduce a system for manual functional annotation, called AFAWE. AFAWE runs different web services for the functional annotation and displays the results and intermediate results in a comprehensive web interface that facilitates comparison. It can be used for any organism and any kind of gene. The inputs are the amino acid sequence and the corresponding organism. Because of its flexible structure, new web services and workflows can be easily integrated. Besides Blast searches against different databases and protein domain prediction tools, AFAWE also includes the phylogenomic pipeline. Different filters help to identify trustworthy results from each analysis. Furthermore a detailed manual annotation can be assigned to each protein, which will be used to update the functional annotation in public databases like MIPSPlantsDB

    SIFTER search: a web server for accurate phylogeny-based protein function prediction.

    Get PDF
    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. The SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded

    Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome

    Get PDF
    BACKGROUND There is growing evidence for the prevalence of copy number variation (CNV) and its role in phenotypic variation in many eukaryotic species. Here we use array comparative genomic hybridization to explore the extent of this type of structural variation in domesticated barley cultivars and wild barleys. RESULTS A collection of 14 barley genotypes including eight cultivars and six wild barleys were used for comparative genomic hybridization. CNV affects 14.9% of all the sequences that were assessed. Higher levels of CNV diversity are present in the wild accessions relative to cultivated barley. CNVs are enriched near the ends of all chromosomes except 4H, which exhibits the lowest frequency of CNVs. CNV affects 9.5% of the coding sequences represented on the array and the genes affected by CNV are enriched for sequences annotated as disease-resistance proteins and protein kinases. Sequence-based comparisons of CNV between cultivars Barke and Morex provided evidence that DNA repair mechanisms of double-strand breaks via single-stranded annealing and synthesis-dependent strand annealing play an important role in the origin of CNV in barley. CONCLUSIONS We present the first catalog of CNVs in a diploid Triticeae species, which opens the door for future genome diversity research in a tribe that comprises the economically important cereal species wheat, barley, and rye. Our findings constitute a valuable resource for the identification of CNV affecting genes of agronomic importance. We also identify potential mechanisms that can generate variation in copy number in plant genomes.This work was financially supported by the following grants: project GABI-BARLEX, German Federal Ministry of Education and Research (BMBF), #0314000 to MP, US, KFXM and NS; Triticeae Coordinated Agricultural Project, USDA-NIFA #2011-68002-30029 to GJM; and Agriculture and Food Research Initiative Plant Genome, Genetics and Breeding Program of USDA’s Cooperative State Research and Extension Service, #2009-65300- 05645 to GJM

    Distribution, functional impact, and origin mechanisms of copy number variation in the barley genome

    Get PDF
    BACKGROUND: There is growing evidence for the prevalence of copy number variation (CNV) and its role in phenotypic variation in many eukaryotic species. Here we use array comparative genomic hybridization to explore the extent of this type of structural variation in domesticated barley cultivars and wild barleys. RESULTS: A collection of 14 barley genotypes including eight cultivars and six wild barleys were used for comparative genomic hybridization. CNV affects 14.9% of all the sequences that were assessed. Higher levels of CNV diversity are present in the wild accessions relative to cultivated barley. CNVs are enriched near the ends of all chromosomes except 4H, which exhibits the lowest frequency of CNVs. CNV affects 9.5% of the coding sequences represented on the array and the genes affected by CNV are enriched for sequences annotated as disease-resistance proteins and protein kinases. Sequence-based comparisons of CNV between cultivars Barke and Morex provided evidence that DNA repair mechanisms of double-strand breaks via single-stranded annealing and synthesis-dependent strand annealing play an important role in the origin of CNV in barley. CONCLUSIONS: We present the first catalog of CNVs in a diploid Triticeae species, which opens the door for future genome diversity research in a tribe that comprises the economically important cereal species wheat, barley, and rye. Our findings constitute a valuable resource for the identification of CNV affecting genes of agronomic importance. We also identify potential mechanisms that can generate variation in copy number in plant genomes

    Bioinformatics assisted breeding, from QTL to candidate genes

    Get PDF
    Over the last decade, the amount of data generated by a single run of a NGS sequencer outperforms days of work done with Sanger sequencing. Metabolomics, proteomics and transcriptomics technologies have also involved producing more and more information at an ever faster rate. In addition, the number of databases available to biologists and breeders is increasing every year. The challenge for them becomes two-fold, namely: to cope with the increased amount of data produced by these new technologies and to cope with the distribution of the information across the Web. An example of a study with a lot of ~omics data is described in Chapter 2, where more than 600 peaks have been measured using liquid chromatography mass-spectrometry (LCMS) in peel and flesh of a segregating F1apple population. In total, 669 mQTL were identified in this study. The amount of mQTL identified is vast and almost overwhelming. Extracting meaningful information from such an experiment requires appropriate data filtering and data visualization techniques. The visualization of the distribution of the mQTL on the genetic map led to the discovery of QTL hotspots on linkage group: 1, 8, 13 and 16. The mQTL hotspot on linkage group 16 was further investigated and mainly contained compounds involved in the phenylpropanoid pathway. The apple genome sequence and its annotation were used to gain insight in genes potentially regulating this QTL hotspot. This led to the identification of the structural gene leucoanthocyanidin reductase (LAR1) as well as seven genes encoding transcription factors as putative candidates regulating the phenylpropanoid pathway, and thus candidates for the biosynthesis of health beneficial compounds. However, this study also indicated bottlenecks in the availability of biologist-friendly tools to visualize large-scale QTL mapping results and smart ways to mine genes underlying QTL intervals. In this thesis, we provide bioinformatics solutions to allow exploration of regions of interest on the genome more efficiently. In Chapter 3, we describe MQ2, a tool to visualize results of large-scale QTL mapping experiments. It allows biologists and breeders to use their favorite QTL mapping tool such as MapQTL or R/qtl and visualize the distribution of these QTL among the genetic map used in the analysis with MQ2. MQ2provides the distribution of the QTL over the markers of the genetic map for a few hundreds traits. MQ2is accessible online via its web interface but can also be used locally via its command line interface. In Chapter 4, we describe Marker2sequence (M2S), a tool to filter out genes of interest from all the genes underlying a QTL. M2S returns the list of genes for a specific genome interval and provides a search function to filter out genes related to the provided keyword(s) by their annotation. Genome annotations often contain cross-references to resources such as the Gene Ontology (GO), or proteins of the UniProt database. Via these annotations, additional information can be gathered about each gene. By integrating information from different resources and offering a way to mine the list of genes present in a QTL interval, M2S provides a way to reduce a list of hundreds of genes to possibly tens or less of genes potentially related to the trait of interest. Using semantic web technologies M2S integrates multiple resources and has the flexibility to extend this integration to more resources as they become available to these technologies. Besides the importance of efficient bioinformatics tools to analyze and visualize data, the work in Chapter 2also revealed the importance of regulatory elements controlling key genes of pathways. The limitation of M2S is that it only considers genes within the interval. In genome annotations, transcription factors are not linked to the trait (keyword) and to the gene it controls, and these relationships will therefore not be considered. By integrating information about the gene regulatory network of the organism into Marker2sequence, it should be able to integrate in its list of genes, genes outside of the QTL interval but regulated by elements present within the QTL interval. In tomato, the genome annotation already lists a number of transcription factors, however, it does not provide any information about their target. In Chapter 5, we describe how we combined transcriptomics information with six genotypes from an Introgression Line (IL) population to find genes differentially expressed while being in a similar genomic background (i.e.: outside of any introgression segments) as the reference genotype (with no introgression). These genes may be differentially expressed as a result of a regulatory element present in an introgression. The promoter regions of these genes have been analyzed for DNA motifs, and putative transcription factor binding sites have been found. The approaches taken in M2S (Chaper 4) are focused on a specific region of the genome, namely the QTL interval. In Chapter 6, we generalized this approach to develop Annotex. Annotex provides a simple way to browse the cross-references existing between biological databases (ChEBI, Rhea, UniProt, GO) and genome annotations. The main concept of Annotex being, that from any type of data present in the databases, one can navigate the cross-references to retrieve the desired type of information. This thesis has resulted in the production of three tools that biologists and breeders can use to speed up their research and build new hypothesis on. This thesis also revealed the state of bioinformatics with regards to data integration. It also reveals the need for integration into annotations (for example, genome annotations, protein annotations, and pathway annotations) of more ontologies than just the Gene Ontology (GO) currently used. Multiple platforms are arising to build these new ontologies but the process of integrating them into existing resources remains to be done. It also confirms the state of the data in plants where multiples resources may contain overlapping. Finally, this thesis also shows what can be achieved when the data is made inter-operable which should be an incentive to the community to work together and build inter-operable, non-overlapping resources, creating a bioinformatics Web for plant research.</p

    Identification of developmental functions for Arabidopsis thaliana genes by a reverse genetics approach based on analysis of H3K27me3 distribution

    Get PDF
    Polycomb Group (PcG) protein mediated gene repression is essential for normal development in both plants and animals, as demonstrated by severe developmental defects resulting from their loss-of-function. PcG proteins convey repression of target genes by tri-methylation of lysine 27 of histone 3 (H3K27me3). Many H3K27me3 decorated genes encode developmental regulators in Arabidopsis thaliana and developmental functions are particularly overrepresented in tissue specific sub sets of H3K27me3 targets. This study identified 105 genes specifically expressed in the shoot apex and floral organs by transcriptional clustering analysis, which are particularly enriched for shoot developmental functions according to Gene Ontology analysis. As half of the genes in this group were not characterised in detail, these were screened for a role in shoot development by analysing loss-of-function mutants and selected can- didate gene overexpessor plants. Fourteen putative Development related PcG Targets in the Apex (DPAs) were identified. For five DPA putants developmental abnormalities were confirmedly associated with the respective loci. Among them were genes related to flowering time, leaf size and leaf shape regulation. dpa4 loss-of-function plants display enhanced leaf serrations and enlarged petals, while leaf margins of 35S::DPA4 plants are smooth. DPA4 encodes for a putative RAV (Related to ABI3/VP1) transcriptional repressor and is expressed in the lateral organ boundary region and in leaf sinuses. Total leaf area and cell numbers are not altered in dpa4 plants, suggesting that DPA4 regulates leaf margin outgrowth by inhibiting growth towards leaf serrations. DPA4 expression domains widely overlap with those of CUP-SHAPED COTYLEDON 2, known to regulate leaf margin shape. Genome-wide transcriptional profiling in dpa4 apices revealed 77 differentially expressed genes. An overrepresentation of auxin-response elements in the promoters of these otherwise poorly characterised genes indicates a role for DPA4 in auxin- mediated signalling. This is further supported by an auxin-influx carrier mutant-like phenotype observed for 35S::DPA4 plants displaying left-hand twisted rosette leaves. Taken together, the data confirm that DPA4, which was identified as a candidate by this reverse genetics screen, is a newly identified player in the signalling network controlling leaf serrations in Arabidopsis thaliana

    Computational approaches for identifying inhibitors of protein interactions

    Get PDF
    Inter-molecular interaction is at the heart of biological function. Proteins can interact with ligands, peptides, small molecules, and other proteins to serve their structural or functional purpose. With advances in combinatorial chemistry and the development of high throughput binding assays, the available inter-molecular interaction data is increasing exponentially. As the space of testable compounds increases, the complexity and cost of finding a suitable inhibitor for a protein interaction increases. Computational drug discovery plays an important role in minimizing the time and cost needed to study the space of testable compounds. This work focuses on the usage of various computational methods in identifying protein interaction inhibitors and demonstrates the ability of computational drug discovery to contribute to the ever growing field of molecular interaction. A program to predict the location of binding surfaces on proteins, STP (Mehio et al., Bioinformatics, 2010, in press), has been created based on calculating the propensity of triplet-patterns of surface protein atoms that occur in binding sites. The use of STP in predicting ligand binding sites, allosteric binding sites, enzyme classification numbers, and binding details in multi-unit complexes is demonstrated. STP has been integrated into the in-house high throughput drug discovery pipeline, allowing the identification of inhibitors for proteins whose binding sites are unknown. Another computational paradigm is introduced, creating a virtual library of -turn peptidomimetics, designed to mimic the interaction of the Baff-Receptor (Baff-R) with the B-Lymphocyte Stimulator (Blys). LIDAEUS (Taylor, et al., Br J Pharmacol, 2008; 153, p. S55-S67) is used to identify chemical groups with favorable binding to Blys. Natural and non-natural sidechains are then used to create a library of synthesizable cyclic hexapeptides that would mimic the Blys:Baff-R interaction. Finally, this work demonstrates the usage and synergy of various in-house computational resources in drug discovery. The ProPep database is a repository used to study trends, motifs, residue pairing frequencies, and aminoacid enrichment propensities in protein-peptide interaction. The LHRLL protein-peptide interaction motif is identified and used with UFSRAT (S. Shave, PhD Thesis, University of Edinburgh, 2010) to conduct ligand-based virtual screening and generate a list of possible antagonists from the EDULISS (K. Hsin, PhD Thesis, University of Edinburgh, 2010) compound repository. A high throughput version of AutoDock (Morris, et al., J Comput Chem, 1998; 19, p. 1639-62) was adapted and used for precision virtual screening of these molecules, resulting in a list of compounds that are likely to inhibit the binding of this motif to several Nuclear Receptors
    corecore