10 research outputs found
BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data
<div><p>BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic <i>E. coli</i> genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future.</p></div
Pipeline of BG7.
<p>Java programs are represented by blue ellipses, quality control programs are represented in green trapezoids and the blue cylinders connect the programs that provide the final results in different formats.</p
Number of genes predicted for <i>E. coli</i> K12.
<p>Number of genes predicted for <i>E. coli</i> K12.</p
BG7 detection of NCBI <i>E. coli</i> K12 genes.
<p>BG7 detection of NCBI <i>E. coli</i> K12 genes.</p
BG7 annotation in different states of completion and error rate of <i>E.coli</i> O104:H4 TY-2482 genome.
<p>False positive and false negative genes in BG7 annotation were detected with reference to the genes predicted by BROAD Institute in the annotation available at “Escherichia coli O104:H4 Sequencing Project, Broad Institute of Harvard and MIT (<a href="http://www.broadinstitute.org/" target="_blank">http://www.broadinstitute.org/</a>). The gene sequences were downloaded on 20-Aug-2012 from: <a href="http://www.broadinstitute.org/annotation/genome/Ecoli_O104_H4/FeatureSearch.html" target="_blank">http://www.broadinstitute.org/annotation/genome/Ecoli_O104_H4/FeatureSearch.html</a>. We used BLASTN between the nucleotide sequences of the BG7 predicted genes and those from BROAD annotation. The graph displays how the number of BG7 not detected genes (false negatives) is very similar in two very different states of genome assembly with very different error rate in the sequence.</p
ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms-3
<p><b>Copyright information:</b></p><p>Taken from "ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms"</p><p>BMC Microbiology 2006;6():29-29.</p><p>Published online 15 Mar 2006</p><p>PMCID:PMC1453763.</p><p>Copyright © 2006 Pareja et al; licensee BioMed Central Ltd.</p>ern" button we obtained the positions of this motif in the set of selected extragenic sequences. Table 3 is the copy of the complete content of this window
ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms-2
<p><b>Copyright information:</b></p><p>Taken from "ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms"</p><p>BMC Microbiology 2006;6():29-29.</p><p>Published online 15 Mar 2006</p><p>PMCID:PMC1453763.</p><p>Copyright © 2006 Pareja et al; licensee BioMed Central Ltd.</p> present more differences but the 17 extragenic sequences conserve the palindromic motif TAC - -ACA- - -
ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms-0
<p><b>Copyright information:</b></p><p>Taken from "ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms"</p><p>BMC Microbiology 2006;6():29-29.</p><p>Published online 15 Mar 2006</p><p>PMCID:PMC1453763.</p><p>Copyright © 2006 Pareja et al; licensee BioMed Central Ltd.</p>d to the "working set". For extragenic sequences 8, 10, 13, 16 and 17 the check-box for obtaining the complementary inverted sequence has been marked. Thus, the 17 upstream extragenic sequences are equally oriented with regard to the start points of the genes. Clicking on "FASTA SEQUENCES" button the user obtains the extragenic sequences in FASTA format. Clicking on "PALINSIGHT" button the user sends the sequences to Palinsight viewer
ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms-1
<p><b>Copyright information:</b></p><p>Taken from "ExtraTrain: a database of Extragenic regions and Transcriptional information in prokaryotic organisms"</p><p>BMC Microbiology 2006;6():29-29.</p><p>Published online 15 Mar 2006</p><p>PMCID:PMC1453763.</p><p>Copyright © 2006 Pareja et al; licensee BioMed Central Ltd.</p>oding AcrR similar proteins. The same palindrome is conserved for all and sequences (extragenic sequences 1–6). Another slightly different palindrome is conserved in and (extragenic sequences 7–10)
Nispero: a cloud-computing based Scala tool specially suited for bioinformatics data processing
<p>Nowadays it is widely accepted that the bioinformatics data analysis is a real bottleneck in many research activities related to life sciences. High-throughput technologies like Next Generation Sequencing (NGS) have completely reshaped the biology and bioinformatics landscape. Undoubtedly NGS has allowed important progress in many life-sciences related fields but has also presented interesting challenges in terms of computation capabilities and algorithms. Many kinds of tasks related with NGS data analysis, as well as other bioinformatics data analysis, can be computed in a parallel, independent way; taking the maximum advantage of this can obviously help in leveraging the analysis bottleneck. </p>
<p>Given the way NGS data is generated scalability plays also an important role in its analysis. NGS data is not generated in a continous fashion but in a batch way, thus the computation needs can be dramatically different at different points. </p>
<p>Cloud computing provides a perfect framework for systems with these two requirements: parallel and scalable. Besides, it allows adjusting the computation power on demand, and thus not being attached to (and paying for) a fixed compute infrastructure. </p>
<p>Nispero is a Scala library for declaring stateless computations and scaling them using cloud computing, in particular a combination of services from AWS (Amazon Web Services). Some highlights are: </p>
<ul>
<li>strongly typed configuration based on Scala code </li>
<li>CRDT-like semantics (a nispero instance is essentially a morphism between idempotent commutative monoids) </li>
<li>automatic deploy/undeploy </li>
</ul>
<p>Nispero relies on the EC2 service (Elastic Compute Cloud) to carry out the computations, on the S3 service (Simple Storage Service) for data storage and on SQS (Simple Queue Service) and SNS (Simple Notification Service) for communication between the different system components. </p>
<p>A Nispero system is composed by: </p>
<ul>
<li>a "console" instance that tracks at any moment the status of the whole system giving the user the opportunity to check at any point the current status of the computations, workers, etc. </li>
<li>a "manager" instance that is in charge of deploying and undeploying the group of workers </li>
<li>a set of "workers" that performs the computations/tasks in a parallel, independent way </li>
<li>SQS queues for "input", "output" and "error" messages </li>
<li>S3 objects for "input" and "output" files </li>
</ul>
<p>The lifecycle of a Nispero system is simple but robust. It starts with the launch of the "console" and "manager" instances, the "manager" then takes the tasks from an S3 object, publishes them in a SQS queue and launches the workers. The workers take the messages with the tasks from the corresponding SQS queue (i.e. the "input" queue) in an independent, parallel way. Once they have finished the computation they put the results of the computation in S3 objects, publish a message in the "output" SQS queue and delete the input message of the corresponding task from the "input" queue. </p>
<p>Nispero is an open-source project released under AGPLv3 license. The source code is available at https://github.com/ohnosequences/nispero</p>
<p>This project is funded in part by the ITN FP7 project INTERCROSSING (Grant 289974)</p