Background: Advances in sequencing technologies challenge the efficient importing and validation of FASTA formatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysis of commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy is hampered. Findings: FastaValidator represents a platform-independent, standardized, light-weight software library written in the Java programming language. It targets computer scientists and bioinformaticians writing software which needs to parse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactive out-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughput validation in software pipelines. Conclusions: The accuracy and performance of the FastaValidator library qualifies it for large data sets such as those commonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardized method for parsing and validating FASTA formatted sequence data

Gerken, J.

Glöckner, F.

Hankeln, W.

Schweer, T.

Waldmann, J.

English

MPG.PuRe

Waldmann et al. BMC Research Notes 2014, 7:365http://www.biomedcentral.com/1756-0500/7/365TECHNICAL NOTE Open AccessFastaValidator: an open-source Java library toparse and validate FASTA formatted sequencesJost Waldmann1,2†, Jan Gerken1,2†, Wolfgang Hankeln3†, Timmy Schweer1 and Frank Oliver Glöckner1,2*AbstractBackground: Advances in sequencing technologies challenge the efficient importing and validation of FASTAformatted sequence data which is still a prerequisite for most bioinformatic tools and pipelines. Comparative analysisof commonly used Bio*-frameworks (BioPerl, BioJava and Biopython) shows that their scalability and accuracy ishampered.Findings: FastaValidator represents a platform-independent, standardized, light-weight software library written inthe Java programming language. It targets computer scientists and bioinformaticians writing software which needs toparse quickly and accurately large amounts of sequence data. For end-users FastaValidator includes an interactiveout-of-the-box validation of FASTA formatted files, as well as a non-interactive mode designed for high-throughputvalidation in software pipelines.Conclusions: The accuracy and performance of the FastaValidator library qualifies it for large data sets such as thosecommonly produced by massive parallel (NGS) technologies. It offers scientists a fast, accurate and standardizedmethod for parsing and validating FASTA formatted sequence data.Keywords: FASTA, Data validation, High-throughputFindingsBackgroundThe introduction of the first DNA sequencing methods[1] established the discipline of bioinformatics withsequences as the primary source of data. With the adventof massive parallel “Next Generation Sequencing (NGS)”technologies [2] the speed of sequence production hasnow reached petabytes per year. The FASTA formatwas introduced alongside with the first algorithms andtools for biological sequence analysis [3,4]. It defineshow sequences are formatted and exchanged in a sim-ple human-readable layout. Today, the FASTA format isthe de facto standard to exchange sequence data betweenbioinformatic tools. Several common frameworks existsoffering FASTA sequence import and validation [5]. Con-cerning their functionality, many of these frameworks arerather complex and not designed for high-volume FASTA*Correspondence: fgloeckn@mpi-bremen.de†Equal contributors1Microbial Genomics and Bioinformatics Research Group, Max Planck Institutefor Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany2Jacobs University Bremen gGmbH, Campusring 1, 28759 Bremen, GermanyFull list of author information is available at the end of the articleparsing and validation. Another common approach is theimplementation of custom solutions. Often these haveproblems recognizing system-specific line endings (Unix,Microsoft, Apple), invalid characters, or even semanti-cally incorrect data. This leads to serious problems indata processing up to invalid results. Furthermore, thefocus of bioinformatics has shifted towards (web-based)pipelines that perform a range of consecutive tasks to ana-lyze sequence data. Therefore, easy integration of FASTAimport and validation functionality into larger softwarepipelines or workflows is becoming a common request. Toaddress issues of parsing, validation, integration, scalabil-ity and performance, we present the light-weight, open-source FastaValidator library written in Java, which parsesand validates sequences in FASTA format. The implemen-tation in the platform-independent Java programminglanguage assures broad usage and easy integration intobioinformatic software and pipelines. The performance ofthe library in comparison to state of the art frameworkshas been evaluated and the ease of integration into webprojects has been demonstrated.© 2014 Waldmann et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, andreproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedicationwaiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwisestated.Waldmann et al. BMC Research Notes 2014, 7:365 Page 2 of 4http://www.biomedcentral.com/1756-0500/7/365ImplementationThe FastaValidator library implements the IUPAC spec-ifications [6-8] extended by letters necessary to parsealigned sequences (space, dash, dot, asterisk). Basedon these specifications four parsing modes are imple-mented: (1) A universal mode that parses and validatesany (multi)FASTA file comprising the nucleotide andamino acid alphabets. (2) A DNAmode, which parses andvalidates only DNA nucleotide sequences. (3) An RNAmode, which parses and validates only RNA nucleotidesequences. (4) A Protein mode, which parses and vali-dates only amino acid sequences. To implement the Fas-taValidator library for high performance, well establishedtechniques from compiler construction have been used. Alexical analyzer (lexer) to parse and syntactically validatethe FASTA format was generated using the JFlex scan-ner generator. The lexer first transforms all characters ofa given FASTA file into syntactically correct tokens. Theparsing mode defines the allowed characters accepted bythe lexer. In a second step the correct semantic order ofthese tokens is validated (e.g. the header must be followedby a comment or sequence). If a FASTA file contains onlycorrect tokens in the right order, it is valid. For every token(end of file (EOF), header-, comment- or sequence line)an event is generated and lines can be transformed intouser defined data structures. To compile the FastaValida-tor from the source code Java 1.5 or higher, JFlex 1.4.3 orhigher (http://www.jflex.de) and Ant 1.8 or higher (http://ant.apache.org) are required.Performance testsAutomated evaluation tests were carried out on a stan-dard Desktop-PC (Intel Core i5, 3 GHz, 16 GB RAM)running the 64 bit server version of Ubuntu Linux 12.04.For performance comparison all tests were run withBioJava 3.0.7 (http://biojava.org), Biopython 1.63 (http://biopython.org) and BioPerl 1.6.9 (http://www.bioperl.org). The underlying test environments were OpenJDK1.7.0_25 for FastaValidator and BioJava, Python 2.7.3 andPyPy 2.2.1 for Biopython and Perl 5.14.2 for BioPerl.Six different data sets were used as input data: (A)all protein sequences of Escherichia coli K-12 [9], (B)the complete genome of Escherichia coli K-12 [9], (C)all protein sequences of the SWISSPROT database asAEscherichia coli K−12 genes(amino acid, 4,146 entries, 1.8 MB)seconds0.00.10.20.30.40.50.6BEscherichia coli K−12 genome(DNA, 1 entry, 4.5 MB)seconds0.00.51.01.52.0CUniProtKB/Swiss−Prot 2013_12(amino acid, 541,954 entries, 248 MB)seconds020406080DGOS Sampling SiteJCVI_SMPL_1103283000001(DNA, 644,551 entries, 1.1 GB)seconds050100150200250ESILVA 115 SSU Parc(RNA, 3,808,884 entries, 3.4 GB)seconds050010001500FSILVA 115 SSU Ref NR(RNA, aligned, 479,726 entries, 21 GB)seconds050015002500Biojava (OpenJDK 1.7.0_25)Bioperl (Perl 5.14.2)Biopython (PyPy 2.2.1)Biopython (Python 2.7.3)FastaValidator (generic parser; OpenJDK 1.7.0_25)FastaValidator (amino acid/DNA/RNA parser; OpenJDK 1.7.0_25)Figure 1 Comparison of validation performance of three bioinformatic frameworks and FastaValidator. Validation performance of threebioinformatic frameworks in comparison to FastaValidator with six different data sets. (A) all protein sequences of Escherichia coli K-12, (B) thecomplete genome of Escherichia coli K-12, (C) the protein sequences of the SWISSPROT database, (D) the metagenomic sequence set from thesampling site 1103283000001 of the Global Ocean Sampling Expedition, (E) the unaligned complete rRNA genes of the SILVA database and (F) thealigned sequences of the SILVA SSU reference database. Missing bars indicate that the corresponding test failed.Waldmann et al. BMC Research Notes 2014, 7:365 Page 3 of 4http://www.biomedcentral.com/1756-0500/7/365of December 2013 [10], (D) one metagenomic sequenceset from a sampling site of the Global Ocean SamplingExpedition (JCVI_SMPL_1103283000001) [11], (E) theunaligned rRNA gene sequences of the SILVA database(SILVA release 115, SSU Parc) [12] and (F) the alignedsequences of the SILVA SSU reference database (SILVArelease 115, SSU Ref NR) [12].As test scenario the counting of valid letters in the inputsequence data was chosen. This included the validationof the input data. Where necessary, the original parsersof the Bio*-Frameworks were extended by a few lines ofcode to perform this validation step based on the availableletter alphabets of the respective frameworks. The overallconstraint for these extensions was to keep the changesas minimal as possible to minimize the influence on theoriginal performance. Each test was performed ten times.The test scripts as well as the raw results are available onthe project’s website.FastaValidatorUIFor end-users who do not intend to write their ownsoftware the FastaValidatorUI (User Interface) can bedownloaded from the project website. It is a platform-independent Java application built on top of the FastaVal-idator library. With its two modes, command-line andgraphical user interface, it can directly be used for high-throughput pipelines as well as for interactive validationwithout any knowledge in programming. The sources ofFastaValidatorUI show how the FastaValidator library canbe integrated in self-written tools. It is located in the demodirectory of the FastaValidator source code repository.Results and discussionThe results in Figure 1 show that the FastaValidatoris on average the fastest validating parser and that itperforms especially well on high-volume sequence datasets. Whilst the other frameworks tested have mod-els for the different sequence letter alphabets in theirdesign, but most of them did not use them properlyin their implementations of the FASTA parser. Depend-ing on the input sequence data the insufficient valida-tion by these frameworks might finally lead to invalidsequences, which can cause serious problems in furtherdownstream processing or even lead to wrong results.Aligned sequences could only be parsed successfully byBioPerl and FastaValidator, because the modeled alpha-bets of BioJava and Biopython are lacking dots which arecommonly found in aligned sequences. Although not usedfor validation, some of the frameworks have the capabilityof auto detecting the alphabet, and by that, the type of anunknown input sequence. These methods cannot be con-sidered as robust, because the amino acid and DNA letteralphabets have overlaps, especially when ambiguities areincluded.ConclusionsThe accuracy and performance of the FastaValidatorlibrary qualifies it for large data sets as they are com-monly produced by massive parallel (NGS) technologies.The ease of integrating FastaValidator into (web based)software pipeline and its efficiency is demonstrated in theopen source project CDinFusion [13] and the SILVAngshigh-throughput data analysis service for ribosomal RNAgene sequence data (https://www.arb-silva.de/ngs/). Forend-users interested in validating their sequence data theready to use FastaValidatorUI can be downloaded from theproject’s website. In summary, FastaValidator offers scien-tists a fast, accurate and standardized method for parsingand validating FASTA formatted sequence data.Availability and requirementsProject name: FastaValidator.Project home page: http://www.megx.net/FastaValidatorSource code repository: https://github.com/jwaldman/FastaValidatorOperating system(s): Platform-independent.Programming language: Java.Other requirements (pre-built): Java 1.5 or higher.Other requirements (build from scratch): Java 1.5 orhigher, JFlex 1.4.3 or higher, Ant 1.8 or higher.License: Lesser GPL 3 (LGPL 3).Any restrictions to use by non-academics: None.Competing interestsThe authors declare that they have no competing interests.Authors’ contributionsJW and JG designed and implemented the software library as well as planningand execution of the performance tests. WH and JW drafted the manuscript.TS participated in the test script implementation and execution. FOG revisedthe manuscript critically. All authors read and approved the final manuscript.AcknowledgementsThe authors would like to thank Dr. Ivaylo Kostadinov and Dr. Jörg Peplies fortheir helpful feedback, and all colleagues who have used the software so farfor their input. The software was developed with financial support from theMax Planck Society. Any views expressed here are those of the authors and notnecessarily those of the funder.Author details1Microbial Genomics and Bioinformatics Research Group, Max Planck Institutefor Marine Microbiology, Celsiusstrasse 1, 28359 Bremen, Germany. 2JacobsUniversity Bremen gGmbH, Campusring 1, 28759 Bremen, Germany.3Mediomix GmbH, Eupener Straße 139, 50933 Köln, Germany.Received: 22 January 2014 Accepted: 10 June 2014Published: 14 June 2014References1. Sanger F, Nicklen S, Coulson A: DNA sequencing withchain-terminating inhibitors. Proc Natl Acad Sci U S A 1977,74(12):5463–5467.2. Mardis ER: Next-generation DNA sequencing methods. Ann RevGenomics HumGenet 2008, 9:387–402.3. Lipman D, Pearson W: Rapid and sensitive protein similarity searches.Science 1985, 227(4693):1435–1441.Waldmann et al. BMC Research Notes 2014, 7:365 Page 4 of 4http://www.biomedcentral.com/1756-0500/7/3654. Pearson WR, Lipman DJ: Improved tools for biological sequencecomparison. Proc Natl Acad Sci 1988, 85(8):2444–2448.5. Mangalam H: The Bio* toolkits–a brief overview. Brief Bioinform 2002,3(3):296–302.6. Cornish-Bowden A: Nomenclature for incompletely specified bases innucleic acid sequencesrecommendations. Nucleic Acids Res 1985,13(9):3021–3030.7. IUPAC-IUB-JCBN: IUPAC-IUB Joint commission on biochemicalnomenclature (JCBN). Nomenclature and symbolism for aminoacids and peptides. Recommendations 1983. Eur J Biochem 1984,138(1):9–37.8. IUPAC-IUB-JCBN: IUPAC-IUB Joint commission on biochemicalnomenclature (JCBN). Nomenclature and symbolism for aminoacids and peptides. Corrections to recommendations 1983. Eur JBiochem 1993, 213(1):2.9. Riley M, Abe T, Arnaud M, Berlyn M, Blattner F, Chaudhuri R, Glasner J,Horiuchi T, Keseler I, Kosuge T, Mori H, Perna N, Plunkett Gr, Rudd K, SerresM, Thomas G, Thomson N, Wishart D, Wanner B: Escherichia coli K-12: acooperatively developed annotation snapshot–2005. Nucleic AcidsRes 2006, 34(1):1–9.10. Apweiler R, Bairoch A, Wu CH: Protein sequence databases. Curr OpinChem Biol 2004, 8(1):76–80.11. Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K,Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, MillerCS, Li H, Mashiyama ST, Joachimiak MP, van Belle C, Chandonia JM,Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R,Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, et al.: The sorcererII global ocean sampling expedition: expanding the universe ofprotein families. PLoS Biol 2007, 5(3):16.12. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J,Glöckner F: The SILVA ribosomal rna gene database project:improved data processing and web-based tools. Nucleic Acids Res2013, 41(D1):590–596.13. Hankeln W, Wendel N, Gerken J, Waldmann J, Buttigieg P, Kostadinov I,Kottmann R, Yilmaz P, Glöckner F: CDinFusion - submission-ready,on-line integration of sequence and contextual data. PLoS ONE 2011,6(9):24797.doi:10.1186/1756-0500-7-365Cite this article as:Waldmann et al.: FastaValidator: an open-source Javalibrary to parse and validate FASTA formatted sequences. BMC ResearchNotes 2014 7:365.Submit your next manuscript to BioMed Centraland take full advantage of: • Convenient online submission• Thorough peer review• No space constraints or color figure charges• Immediate publication on acceptance• Inclusion in PubMed, CAS, Scopus and Google Scholar• Research which is freely available for redistributionSubmit your manuscript at www.biomedcentral.com/submit

FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences

http://hdl.handle.net/21.11116/0000-0006-E60D-9

FastaValidator: an open-source Java library to parse and validate FASTA formatted sequences

Abstract

Similar works

Full text

Available Versions

MPG.PuRe