Skip to main content
Article thumbnail
Location of Repository

TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

By Philippe Leroy, Nicolas Guilhot, Hiroaki Sakai, Aurélien Bernard, Frédéric Choulet, Sébastien Theil, Sébastien Reboux, Naoki Amano, Timothée Flutre, Céline Pelegrin, Hajime Ohyanagi, Michael Seidel, Franck Giacomoni, Mathieu Reichstadt, Michael Alaux, Emmanuelle Gicquello, Fabrice Legeai, Lorenzo Cerutti, Hisataka Numa, Tsuyoshi Tanaka, Klaus Mayer, Takeshi Itoh, Hadi Quesneville and Catherine Feuillet

Abstract

In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future

Topics: Plant Science
Publisher: Frontiers Research Foundation
OAI identifier: oai:pubmedcentral.nih.gov:3355818
Provided by: PubMed Central

Suggested articles

Citations

  1. (2008). A new method to compute K-mer frequencies and its application to annotate large repetitive plant genomes.
  2. (2009). A toolbox for Triticeae genomics,”
  3. (2007). A unified classification system for eukaryotictransposableelements.Nat.Rev.
  4. A,Choulet F,Theil S,Reboux S, AmanoN,Flutre T,PelegrinC,Ohyanagi H,Seidel M,Giacomoni F,Reichstadt
  5. (2011). accepted: 04
  6. (2005). Analysis and mapping of randomly chosen bacterial artificial chromosome clones from hexaploid bread wheat.
  7. (2000). Analysis of the genome sequence of the flowering plant Arabidopsis thaliana.
  8. and Yandell,M.(2008).MAKER:aneasy-touseannotationpipelinedesignedfor emergingmodelorganismgenomes.
  9. (2008). Artemis andACT: viewing,annotating and comparing sequences stored inarelationaldatabase.Bioinformatics 24,
  10. (2005). Automated generation of heuristics for biological sequence comparison.
  11. (2004). automatic gene annotation system.
  12. (2006). Characterizing the composition and evolution of homoeologous genomes in hexaploid wheat through BAC-end sequencing on chromosome 3B.
  13. (2005). Combined evidence annotationof transposableelements in genome sequences.
  14. (2006). Community annotation: procedures, protocols, and supporting tools.
  15. (2011). Considering transposable element diversification in de novo annotation approaches.
  16. (2008). Criteria for annotation of plant microRNAs.
  17. (2011). Crop genome sequencing: lessons and rationales.
  18. (2009). De novo next generation sequencingof plantgenomes.Rice 2,
  19. (2011). Draft genome sequencing and comparative analysis of Aspergillus sojae NBRC4239.
  20. (2010). Efficient plant gene identification based on interspecies mapping of full-length cDNAs.
  21. (2003). Eval: a software package for analysis of genome annotations.
  22. (2009). Evidence-based gene predictions in plant genomes.
  23. (2007). Experimental validation of novel genes predicted in the un-annotated regions
  24. (2001). finder that combines several sources of evidence,”in Computational Biology, eds O. Gascuel and M.-F. Sagot (France:
  25. (2009). Finding genes in Schistosoma japonicum: annotating novelgenomeswithhelpof extrinsic evidence.
  26. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
  27. (2005). Gene identification in novel eukaryotic genomes by self-training algorithm.
  28. (2003). Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19(Suppl. 2), ii215– ii225. The Arabidopsis Genome Initiative.
  29. (1998). GeneMark.hmm: new solutions for gene finding.
  30. (2006). Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features.
  31. (2005). Genome Sequencing Project.
  32. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences.
  33. (1997). improved detection of transfer RNA genes in genomic sequence.
  34. (2010). Insertion site-based polymorphism markers open new perspectives for genome saturation and markerassisted selection in wheat.
  35. (2001). InterProScan – an integration platform for the signaturerecognition methods in InterPro.
  36. (2006). Iterative gene prediction and pseudogene removal improves genomeannotation.GenomeRes.16,
  37. (2010). Megabase level sequencing reveals contrasted organization and evolution patterns of the wheat gene and transposable element spaces.
  38. (2006). MicroRNAS and their regulatory roles in plants.
  39. (2008). Pfam 10 years on: 10,000 families and still growing.
  40. (2008). Physical mapping in large genomes: accelerating anchoring of BAC contigs to genetic maps through in silico analysis.
  41. (2007). Physical mapping of the wheat genome: a coordinated effort to lay thefoundationforgenomesequencing and develop tools for breeders.
  42. (1992). Prediction of gene structure.
  43. (2010). PROSITE, a protein domain databaseforfunctionalcharacterization and annotation.
  44. (1977). Repeated sequence DNA relationshipinfourcerealsgenomes.
  45. (2009). RNA-Seq: a revolutionary tool for transcriptomics.
  46. (2004). Sequence composition, organization, and evolution of the core Triticeae genome.
  47. (2010). Small RNAs are on the move.
  48. (2009). SMART 6: recent updates and new developments.
  49. (2010). Specific patterns of gene space organisation revealed in wheat by using the combination of barley and wheat genomic resources.
  50. (1999). Tandem repeats finder: a program to analyze DNA sequences.
  51. (2009). The B73 maize genome: complexity, diversity, and dynamics.
  52. (2009). The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes.
  53. (2010). The Ectocarpus genome and the independent evolution of multicellularity in brown algae.
  54. (2011). The genome of Theobroma cacao.
  55. (2010). The Pfam protein families database.
  56. (2008). The RNA world is alive and well.
  57. (2006). The small RNA world of plants.
  58. (2004). The TIGR plant repeat databases: a collective resource for the identification of repetitive sequences in plants.
  59. (2002). TREP: a database for Triticeae repetitive elements.
  60. (2012). TriAnnot: an online annotation pipeline
  61. (2006). Using Chado to store genome annotation

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.