2,011 research outputs found
Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches
Improved DNA sequencing methods have transformed the field of genomics over the last decade. This has become possible due to the development of inexpensive short read sequencing technologies which have now resulted in three generations of sequencing platforms. More recently, a new fourth generation of Nanopore based single molecule sequencing technology, was developed based on MinION® sequencer which is portable, inexpensive and fast. It is capable of generating reads of length greater than 100 kb. Though it has many specific advantages, the two major limitations of the MinION reads are high error rates and the need for the development of downstream pipelines. The algorithms for error correction have already emerged, while development of pipelines is still at nascent stage
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
Recommended from our members
NAD tagSeq reveals that NAD+-capped RNAs are mostly produced from a large number of protein-coding genes in Arabidopsis.
The 5' end of a eukaryotic mRNA transcript generally has a 7-methylguanosine (m7G) cap that protects mRNA from degradation and mediates almost all other aspects of gene expression. Some RNAs in Escherichia coli, yeast, and mammals were recently found to contain an NAD+ cap. Here, we report the development of the method NAD tagSeq for transcriptome-wide identification and quantification of NAD+-capped RNAs (NAD-RNAs). The method uses an enzymatic reaction and then a click chemistry reaction to label NAD-RNAs with a synthetic RNA tag. The tagged RNA molecules can be enriched and directly sequenced using the Oxford Nanopore sequencing technology. NAD tagSeq can allow more accurate identification and quantification of NAD-RNAs, as well as reveal the sequences of whole NAD-RNA transcripts using single-molecule RNA sequencing. Using NAD tagSeq, we found that NAD-RNAs in Arabidopsis were produced by at least several thousand genes, most of which are protein-coding genes, with the majority of these transcripts coming from <200 genes. For some Arabidopsis genes, over 5% of their transcripts were NAD capped. Gene ontology terms overrepresented in the 2,000 genes that produced the highest numbers of NAD-RNAs are related to photosynthesis, protein synthesis, and responses to cytokinin and stresses. The NAD-RNAs in Arabidopsis generally have the same overall sequence structures as the canonical m7G-capped mRNAs, although most of them appear to have a shorter 5' untranslated region (5' UTR). The identification and quantification of NAD-RNAs and revelation of their sequence features can provide essential steps toward understanding the functions of NAD-RNAs
Forensic tri-allelic SNP genotyping using nanopore sequencing
The potential and current state-of-the-art of forensic SNP genotyping using nanopore sequencing was investigated with a panel of 16 tri-allelic single nucleotide polymorphisms (SNPs), multiplexing five samples per sequencing run. The sample set consisted of three single-source human genomic reference control DNA samples and two GEDNAP samples, simulating casework samples. The primers for the multiplex SNP-loci PCR were taken from a study which researched their value in a forensic setting using conventional single-base extension technology. Workflows for multiplexed Oxford Nanopore Technologies 1D and 1D(2) sequencing were developed that provide correct genotyping of most SNP loci. Loci that are problematic for nanopore sequencing were characterized. When such loci are avoided, nanopore sequencing of forensic tri-allelic SNPs is technically feasible
Models and information-theoretic bounds for nanopore sequencing
Nanopore sequencing is an emerging new technology for sequencing DNA, which
can read long fragments of DNA (~50,000 bases) in contrast to most current
short-read sequencing technologies which can only read hundreds of bases. While
nanopore sequencers can acquire long reads, the high error rates (20%-30%) pose
a technical challenge. In a nanopore sequencer, a DNA is migrated through a
nanopore and current variations are measured. The DNA sequence is inferred from
this observed current pattern using an algorithm called a base-caller. In this
paper, we propose a mathematical model for the "channel" from the input DNA
sequence to the observed current, and calculate bounds on the information
extraction capacity of the nanopore sequencer. This model incorporates
impairments like (non-linear) inter-symbol interference, deletions, as well as
random response. These information bounds have two-fold application: (1) The
decoding rate with a uniform input distribution can be used to calculate the
average size of the plausible list of DNA sequences given an observed current
trace. This bound can be used to benchmark existing base-calling algorithms, as
well as serving a performance objective to design better nanopores. (2) When
the nanopore sequencer is used as a reader in a DNA storage system, the storage
capacity is quantified by our bounds
Identification of Structural Variation in Chimpanzees Using Optical Mapping and Nanopore Sequencing.
Recent efforts to comprehensively characterize great ape genetic diversity using short-read sequencing and single-nucleotide variants have led to important discoveries related to selection within species, demographic history, and lineage-specific traits. Structural variants (SVs), including deletions and inversions, comprise a larger proportion of genetic differences between and within species, making them an important yet understudied source of trait divergence. Here, we used a combination of long-read and -range sequencing approaches to characterize the structural variant landscape of two additional Pan troglodytes verus individuals, one of whom carries 13% admixture from Pan troglodytes troglodytes. We performed optical mapping of both individuals followed by nanopore sequencing of one individual. Filtering for larger variants (>10 kbp) and combined with genotyping of SVs using short-read data from the Great Ape Genome Project, we identified 425 deletions and 59 inversions, of which 88 and 36, respectively, were novel. Compared with gene expression in humans, we found a significant enrichment of chimpanzee genes with differential expression in lymphoblastoid cell lines and induced pluripotent stem cells, both within deletions and near inversion breakpoints. We examined chromatin-conformation maps from human and chimpanzee using these same cell types and observed alterations in genomic interactions at SV breakpoints. Finally, we focused on 56 genes impacted by SVs in >90% of chimpanzees and absent in humans and gorillas, which may contribute to chimpanzee-specific features. Sequencing a greater set of individuals from diverse subspecies will be critical to establish the complete landscape of genetic variation in chimpanzees
Single-molecule DNA sequencing technologies for future genomics research
During the current genomics revolution, the genomes of a large number of living organisms have been fully sequenced. However, with the advent of new sequencing technologies, genomics research is now at the threshold of a second revolution. Several second-generation sequencing platforms became available in 2007, but a further revolution in DNA resequencing technologies is being witnessed in 2008, with the launch of the first single-molecule DNA sequencer (Helicos Biosciences), which has already been used to resequence the genome of the M13 virus. This review discusses several single-molecule sequencing technologies that are expected to become available during the next few years and explains how they might impact on genomics research
- …