18 research outputs found

    QuorUM: An Error Corrector for Illumina Reads

    No full text
    <div><p>Motivation</p><p>Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 Ă— coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous <i>k</i>-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous <i>k</i>-mers in the output reads and preserving the most true <i>k</i>-mers, and we introduce a composite statistic Ď€ that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.</p><p>Results</p><p>We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.</p><p>Availability</p><p>QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at <a href="http://www.genome.umd.edu" target="_blank">http://www.genome.umd.edu</a>.</p><p>Contact</p><p><a href="mailto:[email protected]" target="_blank">[email protected]</a>.</p></div

    Idealized contig size statistics (in kb).

    No full text
    <p>Idealized contig size statistics (in kb).</p

    Percentage of the original reads that are perfect after error reduction, and percentage of bases contained in perfect reads compared with bases in original reads.

    No full text
    <p>The number in parenthesis is the denominator used to compute the percentage, the number of original reads and the amount of sequence in the original reads respectively.</p

    The assembled NGA50 contig size in kilo-bases for SOAPdenovo.

    No full text
    <p>The “-d0” and “-d1” are parameters to SOAPdenovo instructing the assemblers to use all 31-mers or to ignore the 31-mers occurring only once. For MaSuRCA, which incorporates QuorUM, the result is in parentheses.</p

    Percent of false 31-mers remaining and true 31-mers missing in error corrected reads.

    No full text
    <p>The numbers for “false remain” and “true missing” in the table are percentages. We list the denominators used for the percentages in the headers of each of these columns. For the “false remain”, this denominator is the number of the false 31-mers in the original reads and for the “true missing”, it is the number of 31-mers in the reference. The “score” <i>π</i> = the product of the “false remain” and “true missing” columns. QuorUM’s <i>π</i> score is the best with a factor of 30, 15, and 3.5 better than the second best for Rhodobacter, Staphylococcus and Mouse C16 data sets respectively.</p

    Number of chimeric reads per 10000 after correction.

    No full text
    <p>Number of chimeric reads per 10000 after correction.</p

    Additional file 6 of Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species

    No full text
    Supplementary material overview. We give an overview of the supplementary material and methods provided in this paper. (PDF 74 kb

    Additional file 3 of Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species

    No full text
    Supplementary material: database of the computed transcription factor binding probabilities. We report the database of the computed binding probabilities of 126 transcription factors to 2557 genes shared by 23 species. (TXT 415 kb

    Additional file 1 of Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species

    No full text
    Supplementary material: all genes present in each module. We report all modules and for each module, we list all genes in the module. Each gene entry includes identifier and name. (PDF 91 kb
    corecore