6 research outputs found

    Reranking candidate gene models with cross-species comparison for improved gene prediction

    Get PDF
    Background: Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features. Results: We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the K best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on Drosophila melanogaster demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+. Conclusion: Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models

    Automated alignment-based curation of gene models in filamentous fungi

    Get PDF
    BACKGROUND: Automated gene-calling is still an error-prone process, particularly for the highly plastic genomes of fungal species. Improvement through quality control and manual curation of gene models is a time-consuming process that requires skilled biologists and is only marginally performed. The wealth of available fungal genomes has not yet been exploited by an automated method that applies quality control of gene models in order to obtain more accurate genome annotations. RESULTS: We provide a novel method named alignment-based fungal gene prediction (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It can assess gene models on a gene-by-gene basis making use of informant gene loci. Its performance was benchmarked on 6,965 gene models confirmed by full-length unigenes from ten different fungi. 79.4% of all gene models were correctly predicted by ABFGP. It improves the output of ab initio gene prediction software due to a higher sensitivity and precision for all gene model components. Applicability of the method was shown by revisiting the annotations of six different fungi, using gene loci from up to 29 fungal genomes as informants. Between 7,231 and 8,337 genes were assessed by ABFGP and for each genome between 1,724 and 3,505 gene model revisions were proposed. The reliability of the proposed gene models is assessed by an a posteriori introspection procedure of each intron and exon in the multiple gene model alignment. The total number and type of proposed gene model revisions in the six fungal genomes is correlated to the quality of the genome assembly, and to sequencing strategies used in the sequencing centre, highlighting different types of errors in different annotation pipelines. The ABFGP method is particularly successful in discovering sequence errors and/or disruptive mutations causing truncated and erroneous gene models. CONCLUSIONS: The ABFGP method is an accurate and fully automated quality control method for fungal gene catalogues that can be easily implemented into existing annotation pipelines. With the exponential release of new genomes, the ABFGP method will help decreasing the number of gene models that require additional manual curation

    Automatic Figure Ranking and User Interfacing for Intelligent Figure Search

    Get PDF
    Figures are important experimental results that are typically reported in full-text bioscience articles. Bioscience researchers need to access figures to validate research facts and to formulate or to test novel research hypotheses. On the other hand, the sheer volume of bioscience literature has made it difficult to access figures. Therefore, we are developing an intelligent figure search engine (http://figuresearch.askhermes.org). Existing research in figure search treats each figure equally, but we introduce a novel concept of "figure ranking": figures appearing in a full-text biomedical article can be ranked by their contribution to the knowledge discovery.We empirically validated the hypothesis of figure ranking with over 100 bioscience researchers, and then developed unsupervised natural language processing (NLP) approaches to automatically rank figures. Evaluating on a collection of 202 full-text articles in which authors have ranked the figures based on importance, our best system achieved a weighted error rate of 0.2, which is significantly better than several other baseline systems we explored. We further explored a user interfacing application in which we built novel user interfaces (UIs) incorporating figure ranking, allowing bioscience researchers to efficiently access important figures. Our evaluation results show that 92% of the bioscience researchers prefer as the top two choices the user interfaces in which the most important figures are enlarged. With our automatic figure ranking NLP system, bioscience researchers preferred the UIs in which the most important figures were predicted by our NLP system than the UIs in which the most important figures were randomly assigned. In addition, our results show that there was no statistical difference in bioscience researchers' preference in the UIs generated by automatic figure ranking and UIs by human ranking annotation.The evaluation results conclude that automatic figure ranking and user interfacing as we reported in this study can be fully implemented in online publishing. The novel user interface integrated with the automatic figure ranking system provides a more efficient and robust way to access scientific information in the biomedical domain, which will further enhance our existing figure search engine to better facilitate accessing figures of interest for bioscientists

    Reranking candidate gene models with cross-species comparison for improved gene prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features.</p> <p>Results</p> <p>We have investigated the utility of a simple post-processing step for selecting among a set of alternative gene models, using global scoring rules to rerank competing models for more accurate prediction. For each gene locus, we first generate the <it>K </it>best candidate gene models using the gene finder Evigan, and then rerank these models using comparisons with putative orthologous genes from closely-related species. Candidate gene models with lower scores in the original gene finder may be selected if they exhibit strong similarity to probable orthologs in coding sequence, splice site location, or signal peptide occurrence. Experiments on <it>Drosophila melanogaster </it>demonstrate that reranking based on cross-species comparison outperforms the best gene models identified by Evigan alone, and also outperforms the comparative gene finders GeneWise and Augustus+.</p> <p>Conclusion</p> <p>Reranking gene models with cross-species comparison improves gene prediction accuracy. This straightforward method can be readily adapted to incorporate additional lines of evidence, as it requires only a ranked source of candidate gene models.</p

    Comparative genomics of Dothideomycete fungi

    Get PDF
    Fungi are a diverse group of eukaryotic micro-organisms particularly suited for comparative genomics analyses. Fungi are important to industry, fundamental science and many of them are notorious pathogens of crops, thereby endangering global food supply. Dozens of fungi have been sequenced in the last decade and with the advances of the next generation sequencing, thousands of new genome sequences will become available in coming years. In this thesis I have used bioinformatics tools to study different biological and evolutionary processes in various genomes with a focus on the genomes of the Dothideomycetefungi Cladosporium fulvum, Dothistroma septosporumand Zymoseptoria tritici. Chapter 1introduces the scientific disciplines of mycology and bioinformatics from a historical perspective. It exemplifies a typical whole-genome sequence analysis of a fungal genome, and focusses in particular on structural gene annotation and detection of transposable elements. In addition it shortly reviews the microRNA pathway as known in animal and plants in the context of the putative existence of similar yet subtle different small RNA pathways in other branches of the eukaryotic tree of life. Chapter 2addresses the novel sequenced genomes of the closely related Dothideomyceteplant pathogenic fungi Cladosporium fulvumand Dothistroma septosporum. Remarkably, it revealed occurrence of a surprisingly high similarity at the protein level combined with striking differences at the DNA level, gene repertoire and gene expression. Most noticeably, the genome of C. fulvumappears to be at least twice as large, which is solely attributable to a much larger content in repetitive sequences. Chapter 3describes a novel alignment-based fungal gene prediction method (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It shows excellent performance benchmarked on a dataset of 7,000 unigene-supported gene models from ten different fungi. Applicability of the method was shown by revisiting the annotations of C. fulvumand D. septosporumand of various other fungal genomes from the first-generation sequencing era. Thousands of gene models were revised in each of the gene catalogues, indeed revealing a correlation to the quality of the genome assembly, and to sequencing strategies used in the sequencing centres, highlighting different types of errors in different annotation pipelines. Chapter 4focusses on the unexpected high number of gene models that were identified by ABFGP that align nicely to informant genes, but only upon toleration of frame shifts and in-frame stop-codons. These discordances could represent sequence errors (SEs) and/or disruptive mutations (DMs) that caused these truncated and erroneous gene models. We revisited the same fungal gene catalogues as in chapter 3, confirmed SEs by resequencing and successively removed those, yielding a high-confidence and large dataset of nearly 1,000 pseudogenes caused by DMs. This dataset of fungal pseudogenes, containing genes listed as bona fide genes in current gene catalogues, does not correspond to various observations previously done on fungal pseudogenes. Moreover, the degree of pseudogenization showing up to a ten-fold variation for the lowest versus the highest affected species, is generally higher in species that reproduce asexually compared to those that in addition reproduce sexually. Chapter 5describes explorative genomics and comparative genomics analyses revealing the presence of introner-like elements (ILEs) in various Dothideomycetefungi including Zymoseptoria triticiin which they had not identified yet, although its genome sequence is already publicly available for several years. ILEs combine hallmark intron properties with the apparent capability of multiplying themselves as repetitive sequence. ILEs strongly associate with events of intron gain, thereby delivering in silico proof of their mobility. Phylogenetic analyses at the intra- and inter-species level showed that most ILEs are related and likely share common ancestry. Chapter 6provides additional evidence that ILE multiplication strongly dominates over other types of intron duplication in fungi. The observed high rate of ILE multiplication followed by rapid sequence degeneration led us to hypothesize that multiplication of ILEs has been the major cause and mechanism of intron gain in fungi, and we speculate that this could be generalized to all eukaryotes. Chapter 7describes a new strategy for miRNA hairpin prediction using statistical distributions of observed biological variation of properties (descriptors) of known miRNA hairpins. We show that the method outperforms miRNA prediction by previous, conventional methods that usually apply threshold filtering. Using this method, several novel candidate miRNAs were assigned in the genomes of Caenorhabditis elegansand two human viruses. Although this chapter is not applied on fungi, the study does provide a flexible method to find evidence for existence of a putative miRNA-like pathway in fungi. Chapter 8provides a general discussion on the advent of bioinformatics in mycological research and its implications. It highlights the necessity of a prioriplanning and integration of functional analysis and bioinformatics in order to achieve scientific excellence, and describes possible scenarios for the near future of fungal (comparative) genomics research. Moreover, it discusses the intrinsic error rate in large-scale, automatically inferred datasets and the implications of using and comparing those.</p

    Comparative genomics of Dothideomycete fungi

    Get PDF
    Fungi are a diverse group of eukaryotic micro-organisms particularly suited for comparative genomics analyses. Fungi are important to industry, fundamental science and many of them are notorious pathogens of crops, thereby endangering global food supply. Dozens of fungi have been sequenced in the last decade and with the advances of the next generation sequencing, thousands of new genome sequences will become available in coming years. In this thesis I have used bioinformatics tools to study different biological and evolutionary processes in various genomes with a focus on the genomes of the Dothideomycetefungi Cladosporium fulvum, Dothistroma septosporumand Zymoseptoria tritici. Chapter 1introduces the scientific disciplines of mycology and bioinformatics from a historical perspective. It exemplifies a typical whole-genome sequence analysis of a fungal genome, and focusses in particular on structural gene annotation and detection of transposable elements. In addition it shortly reviews the microRNA pathway as known in animal and plants in the context of the putative existence of similar yet subtle different small RNA pathways in other branches of the eukaryotic tree of life. Chapter 2addresses the novel sequenced genomes of the closely related Dothideomyceteplant pathogenic fungi Cladosporium fulvumand Dothistroma septosporum. Remarkably, it revealed occurrence of a surprisingly high similarity at the protein level combined with striking differences at the DNA level, gene repertoire and gene expression. Most noticeably, the genome of C. fulvumappears to be at least twice as large, which is solely attributable to a much larger content in repetitive sequences. Chapter 3describes a novel alignment-based fungal gene prediction method (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It shows excellent performance benchmarked on a dataset of 7,000 unigene-supported gene models from ten different fungi. Applicability of the method was shown by revisiting the annotations of C. fulvumand D. septosporumand of various other fungal genomes from the first-generation sequencing era. Thousands of gene models were revised in each of the gene catalogues, indeed revealing a correlation to the quality of the genome assembly, and to sequencing strategies used in the sequencing centres, highlighting different types of errors in different annotation pipelines. Chapter 4focusses on the unexpected high number of gene models that were identified by ABFGP that align nicely to informant genes, but only upon toleration of frame shifts and in-frame stop-codons. These discordances could represent sequence errors (SEs) and/or disruptive mutations (DMs) that caused these truncated and erroneous gene models. We revisited the same fungal gene catalogues as in chapter 3, confirmed SEs by resequencing and successively removed those, yielding a high-confidence and large dataset of nearly 1,000 pseudogenes caused by DMs. This dataset of fungal pseudogenes, containing genes listed as bona fide genes in current gene catalogues, does not correspond to various observations previously done on fungal pseudogenes. Moreover, the degree of pseudogenization showing up to a ten-fold variation for the lowest versus the highest affected species, is generally higher in species that reproduce asexually compared to those that in addition reproduce sexually. Chapter 5describes explorative genomics and comparative genomics analyses revealing the presence of introner-like elements (ILEs) in various Dothideomycetefungi including Zymoseptoria triticiin which they had not identified yet, although its genome sequence is already publicly available for several years. ILEs combine hallmark intron properties with the apparent capability of multiplying themselves as repetitive sequence. ILEs strongly associate with events of intron gain, thereby delivering in silico proof of their mobility. Phylogenetic analyses at the intra- and inter-species level showed that most ILEs are related and likely share common ancestry. Chapter 6provides additional evidence that ILE multiplication strongly dominates over other types of intron duplication in fungi. The observed high rate of ILE multiplication followed by rapid sequence degeneration led us to hypothesize that multiplication of ILEs has been the major cause and mechanism of intron gain in fungi, and we speculate that this could be generalized to all eukaryotes. Chapter 7describes a new strategy for miRNA hairpin prediction using statistical distributions of observed biological variation of properties (descriptors) of known miRNA hairpins. We show that the method outperforms miRNA prediction by previous, conventional methods that usually apply threshold filtering. Using this method, several novel candidate miRNAs were assigned in the genomes of Caenorhabditis elegansand two human viruses. Although this chapter is not applied on fungi, the study does provide a flexible method to find evidence for existence of a putative miRNA-like pathway in fungi. Chapter 8provides a general discussion on the advent of bioinformatics in mycological research and its implications. It highlights the necessity of a prioriplanning and integration of functional analysis and bioinformatics in order to achieve scientific excellence, and describes possible scenarios for the near future of fungal (comparative) genomics research. Moreover, it discusses the intrinsic error rate in large-scale, automatically inferred datasets and the implications of using and comparing those.</p
    corecore