Search CORE

18 research outputs found

QuorUM: An Error Corrector for Illumina Reads

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date: 17/06/2015
Field of study

<div>MotivationIllumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 × coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous k-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous k-mers in the output reads and preserving the most true k-mers, and we introduce a composite statistic π that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.ResultsWe produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.AvailabilityQuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at <a href="http://www.genome.umd.edu" target="_blank">http://www.genome.umd.edu</a>.Contact<a href="mailto:[email protected]" target="_blank">[email protected]</a>.</div

Directory of Open Access Journals

FigShare

Idealized contig size statistics (in kb).

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

Idealized contig size statistics (in kb).</p

FigShare

Runtime of each program in hours:minutes:seconds, using 16 threads, and memory usage in giga-bytes.

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

The number of bases in each genome is reported in each column.</p

FigShare

Percentage of the original reads that are perfect after error reduction, and percentage of bases contained in perfect reads compared with bases in original reads.

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

The number in parenthesis is the denominator used to compute the percentage, the number of original reads and the amount of sequence in the original reads respectively.</p

FigShare

The assembled NGA50 contig size in kilo-bases for SOAPdenovo.

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

The “-d0” and “-d1” are parameters to SOAPdenovo instructing the assemblers to use all 31-mers or to ignore the 31-mers occurring only once. For MaSuRCA, which incorporates QuorUM, the result is in parentheses.</p

FigShare

Percent of false 31-mers remaining and true 31-mers missing in error corrected reads.

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

The numbers for “false remain” and “true missing” in the table are percentages. We list the denominators used for the percentages in the headers of each of these columns. For the “false remain”, this denominator is the number of the false 31-mers in the original reads and for the “true missing”, it is the number of 31-mers in the reference. The “score” π = the product of the “false remain” and “true missing” columns. QuorUM’s π score is the best with a factor of 30, 15, and 3.5 better than the second best for Rhodobacter, Staphylococcus and Mouse C16 data sets respectively.</p

FigShare

Number of chimeric reads per 10000 after correction.

Author: Aleksey Zimin (756350)
Guillaume Marçais (4484032)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

Number of chimeric reads per 10000 after correction.</p

FigShare

Additional file 6 of Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species

Author: Adriana MuĂąoz (3491801)
Aleksey Zimin (756350)
Daniella Santos MuĂąoz (3491798)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

Supplementary material overview. We give an overview of the supplementary material and methods provided in this paper. (PDF 74 kb

FigShare

Additional file 3 of Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species

Author: Adriana MuĂąoz (3491801)
Aleksey Zimin (756350)
Daniella Santos MuĂąoz (3491798)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

Supplementary material: database of the computed transcription factor binding probabilities. We report the database of the computed binding probabilities of 126 transcription factors to 2557 genes shared by 23 species. (TXT 415 kb

FigShare

Additional file 1 of Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species

Author: Adriana MuĂąoz (3491801)
Aleksey Zimin (756350)
Daniella Santos MuĂąoz (3491798)
James A. Yorke (242491)
Publication venue
Publication date
Field of study

Supplementary material: all genes present in each module. We report all modules and for each module, we list all genes in the module. Each gene entry includes identifier and name. (PDF 91 kb

FigShare