62 research outputs found

    A New Rhesus Macaque Assembly and Annotation for Next-Generation Sequencing Analyses

    Get PDF
    BACKGROUND: The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses. RESULTS: We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies. CONCLUSIONS: The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates. REVIEWERS: This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova

    Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies

    Get PDF
    BACKGROUND: The size and complexity of conifer genomes has, until now, prevented full genome sequencing and assembly. The large research community and economic importance of loblolly pine, Pinus taeda L., made it an early candidate for reference sequence determination. RESULTS: We develop a novel strategy to sequence the genome of loblolly pine that combines unique aspects of pine reproductive biology and genome assembly methodology. We use a whole genome shotgun approach relying primarily on next generation sequence generated from a single haploid seed megagametophyte from a loblolly pine tree, 20-1010, that has been used in industrial forest tree breeding. The resulting sequence and assembly was used to generate a draft genome spanning 23.2 Gbp and containing 20.1 Gbp with an N50 scaffold size of 66.9 kbp, making it a significant improvement over available conifer genomes. The long scaffold lengths allow the annotation of 50,172 gene models with intron lengths averaging over 2.7 kbp and sometimes exceeding 100 kbp in length. Analysis of orthologous gene sets identifies gene families that may be unique to conifers. We further characterize and expand the existing repeat library based on the de novo analysis of the repetitive content, estimated to encompass 82% of the genome. CONCLUSIONS: In addition to its value as a resource for researchers and breeders, the loblolly pine genome sequence and assembly reported here demonstrates a novel approach to sequencing the large and complex genomes of this important group of plants that can now be widely applied

    An Automated Benchmarking Toolset

    No full text
    . The drive for performance in parallel computing and the need to evaluate platform upgrades or replacements are major reasons frequent running of benchmark codes has become commonplace for application and platform evaluation and tuning. NIST is developing a prototype for an automated benchmarking toolset to reduce the manual effort in running and analyzing the results of such benchmarks. Our toolset consists of three main modules. A Data Collection and Storage module handles the collection of performance data and implements a central repository for such data. Another module provides an integrated mechanism to analyze and visualize the data stored in the repository. An Experiment Control Module assists the user in designing and executing experiments. To reduce the development effort this toolset is built around existing tools and is designed to be easily extensible to support other tools. Keywords. Cluster computing, data collection, database, performance measurement, perf..

    QuorUM: An Error Corrector for Illumina Reads

    No full text
    <div><p>Motivation</p><p>Illumina Sequencing data can provide high coverage of a genome by relatively short (most often 100 bp to 150 bp) reads at a low cost. Even with low (advertised 1%) error rate, 100 Ă— coverage Illumina data on average has an error in some read at every base in the genome. These errors make handling the data more complicated because they result in a large number of low-count erroneous <i>k</i>-mers in the reads. However, there is enough information in the reads to correct most of the sequencing errors, thus making subsequent use of the data (e.g. for mapping or assembly) easier. Here we use the term “error correction” to denote the reduction in errors due to both changes in individual bases and trimming of unusable sequence. We developed an error correction software called QuorUM. QuorUM is mainly aimed at error correcting Illumina reads for subsequent assembly. It is designed around the novel idea of minimizing the number of distinct erroneous <i>k</i>-mers in the output reads and preserving the most true <i>k</i>-mers, and we introduce a composite statistic Ď€ that measures how successful we are at achieving this dual goal. We evaluate the performance of QuorUM by correcting actual Illumina reads from genomes for which a reference assembly is available.</p><p>Results</p><p>We produce trimmed and error-corrected reads that result in assemblies with longer contigs and fewer errors. We compared QuorUM against several published error correctors and found that it is the best performer in most metrics we use. QuorUM is efficiently implemented making use of current multi-core computing architectures and it is suitable for large data sets (1 billion bases checked and corrected per day per core). We also demonstrate that a third-party assembler (SOAPdenovo) benefits significantly from using QuorUM error-corrected reads. QuorUM error corrected reads result in a factor of 1.1 to 4 improvement in N50 contig size compared to using the original reads with SOAPdenovo for the data sets investigated.</p><p>Availability</p><p>QuorUM is distributed as an independent software package and as a module of the MaSuRCA assembly software. Both are available under the GPL open source license at <a href="http://www.genome.umd.edu" target="_blank">http://www.genome.umd.edu</a>.</p><p>Contact</p><p><a href="mailto:[email protected]" target="_blank">[email protected]</a>.</p></div

    Large scale sequence alignment via efficient inference in generative models

    No full text
    Abstract Finding alignments between millions of reads and genome sequences is crucial in computational biology. Since the standard alignment algorithm has a large computational cost, heuristics have been developed to speed up this task. Though orders of magnitude faster, these methods lack theoretical guarantees and often have low sensitivity especially when reads have many insertions, deletions, and mismatches relative to the genome. Here we develop a theoretically principled and efficient algorithm that has high sensitivity across a wide range of insertion, deletion, and mutation rates. We frame sequence alignment as an inference problem in a probabilistic model. Given a reference database of reads and a query read, we find the match that maximizes a log-likelihood ratio of a reference read and query read being generated jointly from a probabilistic model versus independent models. The brute force solution to this problem computes joint and independent probabilities between each query and reference pair, and its complexity grows linearly with database size. We introduce a bucketing strategy where reads with higher log-likelihood ratio are mapped to the same bucket with high probability. Experimental results show that our method is more accurate than the state-of-the-art approaches in aligning long-reads from Pacific Bioscience sequencers to genome sequences

    Percentage of the original reads that are perfect after error reduction, and percentage of bases contained in perfect reads compared with bases in original reads.

    No full text
    <p>The number in parenthesis is the denominator used to compute the percentage, the number of original reads and the amount of sequence in the original reads respectively.</p
    • …
    corecore