78,813 research outputs found
Minimap2: pairwise alignment for nucleotide sequences
Motivation: Recent advances in sequencing technologies promise ultra-long
reads of 100 kilo bases (kb) in average, full-length mRNA or cDNA reads
in high throughput and genomic contigs over 100 mega bases (Mb) in length.
Existing alignment programs are unable or inefficient to process such data at
scale, which presses for the development of new alignment algorithms.
Results: Minimap2 is a general-purpose alignment program to map DNA or long
mRNA sequences against a large reference database. It works with accurate short
reads of 100bp in length, 1kb genomic reads at error rate 15%,
full-length noisy Direct RNA or cDNA reads, and assembly contigs or closely
related full chromosomes of hundreds of megabases in length. Minimap2 does
split-read alignment, employs concave gap cost for long insertions and
deletions (INDELs) and introduces new heuristics to reduce spurious alignments.
It is 3-4 times faster than mainstream short-read mappers at comparable
accuracy and 30 times faster at higher accuracy for both genomic and mRNA
reads, surpassing most aligners specialized in one type of alignment.
Availability and implementation: https://github.com/lh3/minimap2
Contact: [email protected]: The final submitted versio
Partial bisulfite conversion for unique template sequencing
We introduce a new protocol, mutational sequencing or muSeq, which uses sodium bisulfite to randomly deaminate unmethylated cytosines at a fixed and tunable rate. The muSeq protocol marks each initial template molecule with a unique mutation signature that is present in every copy of the template, and in every fragmented copy of a copy. In the sequenced read data, this signature is observed as a unique pattern of C-to-T or G-to-A nucleotide conversions. Clustering reads with the same conversion pattern enables accurate count and long-range assembly of initial template molecules from short-read sequence data. We explore count and low-error sequencing by profiling 135 000 restriction fragments in a PstI representation, demonstrating that muSeq improves copy number inference and significantly reduces sporadic sequencer error. We explore long-range assembly in the context of cDNA, generating contiguous transcript clusters greater than 3,000 bp in length. The muSeq assemblies reveal transcriptional diversity not observable from short-read data alone
Recommended from our members
Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA.
High-throughput short-read sequencing has revolutionized how transcriptomes are quantified and annotated. However, while Illumina short-read sequencers can be used to analyze entire transcriptomes down to the level of individual splicing events with great accuracy, they fall short of analyzing how these individual events are combined into complete RNA transcript isoforms. Because of this shortfall, long-distance information is required to complement short-read sequencing to analyze transcriptomes on the level of full-length RNA transcript isoforms. While long-read sequencing technology can provide this long-distance information, there are issues with both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long-read sequencing technologies that prevent their widespread adoption. Briefly, PacBio sequencers produce low numbers of reads with high accuracy, while ONT sequencers produce higher numbers of reads with lower accuracy. Here, we introduce and validate a long-read ONT-based sequencing method. At the same cost, our Rolling Circle Amplification to Concatemeric Consensus (R2C2) method generates more accurate reads of full-length RNA transcript isoforms than any other available long-read sequencing method. These reads can then be used to generate isoform-level transcriptomes for both genome annotation and differential expression analysis in bulk or single-cell samples
Single cell transcriptome analysis using next generation sequencing.
The heterogeneity of tissues, especially in cancer research, is a central issue in transcriptome analysis. In recent years, research has primarily focused on the development of methods for single cell analysis. Single cell analysis aims at gaining (novel) insights into biological processes of healthy and diseased cells. Some of the challenges in transcriptome analysis concern low abundance of sample starting material, necessary sample amplification steps and subsequent analysis. In this study, two fundamentally different approaches to amplification were compared using next-generation sequencing analysis: I. exponential amplification using polymerase-chain-reaction (PCR) and II. linear amplification. For both approaches, protocols for single cell extraction, cell lysis, cDNA synthesis, cDNA amplification and preparation of next-generation sequencing libraries were developed. We could successfully show that transcriptome analysis of low numbers of cells is feasible with both exponential and linear amplification. Using exponential amplification, the highest amplification rates up to 106 were possible. The reproducibility of results is a strength of the linear amplification method. The analysis of next generation sequencing data in single cell samples showed detectable expression in at least 16.000 genes. The variance between samples results in a need to work with a greater amount of biological replicates. In summary it can be said that single cell transcriptome analysis with next generation sequencing is possible but improvements leading to a higher yield of transcriptome reads is required. In the near future by comparing single cancer cells with healthy ones for example, a basis for improved prognosis and diagnosis can be realised
Capturing the ‘ome’ : the expanding molecular toolbox for RNA and DNA library construction
All sequencing experiments and most functional genomics screens rely on the generation of libraries to comprehensively capture pools of targeted sequences. In the past decade especially, driven by the progress in the field of massively parallel sequencing, numerous studies have comprehensively assessed the impact of particular manipulations on library complexity and quality, and characterized the activities and specificities of several key enzymes used in library construction. Fortunately, careful protocol design and reagent choice can substantially mitigate many of these biases, and enable reliable representation of sequences in libraries. This review aims to guide the reader through the vast expanse of literature on the subject to promote informed library generation, independent of the application
Digital gene expression analysis of the zebra finch genome
Background: In order to understand patterns of adaptation and molecular evolution it is important to quantify both variation in gene expression and nucleotide sequence divergence. Gene expression profiling in non-model organisms has recently been facilitated by the advent of massively parallel sequencing technology. Here we investigate tissue specific gene expression patterns in the zebra finch (Taeniopygia guttata) with special emphasis on the genes of the major histocompatibility complex (MHC).
Results: Almost 2 million 454-sequencing reads from cDNA of six different tissues were assembled and analysed. A total of 11,793 zebra finch transcripts were represented in this EST data, indicating a transcriptome coverage of about 65%. There was a positive correlation between the tissue specificity of gene expression and non-synonymous to synonymous nucleotide substitution ratio of genes, suggesting that genes with a specialised function are evolving at a higher rate (or with less constraint) than genes with a more general function. In line with this, there was also a negative correlation between overall expression levels and expression specificity of contigs. We found evidence for expression of 10 different genes related to the MHC. MHC genes showed relatively tissue specific expression levels and were in general primarily expressed in spleen. Several MHC genes, including MHC class I also showed expression in brain. Furthermore, for all genes with highest levels of expression in spleen there was an overrepresentation of several gene ontology terms related to immune function.
Conclusions: Our study highlights the usefulness of next-generation sequence data for quantifying gene expression in the genome as a whole as well as in specific candidate genes. Overall, the data show predicted patterns of gene expression profiles and molecular evolution in the zebra finch genome. Expression of MHC genes in particular, corresponds well with expression patterns in other vertebrates
Single genome sequencing of near full-length HIV-1 RNA using a limiting dilution approach
Sequencing very long stretches of the HIV-1 genome can advance studies on virus evolution and in vivo recombination but remains technically challenging. We developed an efficient procedure to sequence near full-length HIV-1 RNA using a two-amplicon approach. The whole genome was successfully amplified for 107 (88%) of 121 plasma samples including samples from patients infected with HIV-1 subtype A1, B, C, D, F1, G, H, CRF01_AE and CRF02_AG. For the 17 samples with a viral load below 1000 c/ml and the 104 samples with a viral load above 1000 c/ml, the amplification efficiency was respectively 53% and 94%. The sensitivity of the method was further evaluated using limiting dilution of RNA extracted from a plasma pool containing an equimolar mixture of three HIV-1 subtypes (B, C and CRF02_AG) and diluted before and after cDNA generation. Both RNA and cDNA dilution showed comparable sensitivity and equal accuracy in reflecting the subtype distribution of the plasma pool. One single event of in vitro recombination was detected amongst the 41 sequences obtained after cDNA dilution but no indications for in vitro recombination were found after RNA dilution. In conclusion, a two-amplicon strategy and limiting dilution of viral RNA followed by reverse transcription, nested PCR and Sanger sequencing, allows near full genome sequencing of individual HIV-1 RNA molecules. This method will be a valuable tool in the study of virus evolution and recombination
Recommended from our members
De novo assembly of the cattle reference genome with single-molecule sequencing.
BackgroundMajor advances in selection progress for cattle have been made following the introduction of genomic tools over the past 10-12 years. These tools depend upon the Bos taurus reference genome (UMD3.1.1), which was created using now-outdated technologies and is hindered by a variety of deficiencies and inaccuracies.ResultsWe present the new reference genome for cattle, ARS-UCD1.2, based on the same animal as the original to facilitate transfer and interpretation of results obtained from the earlier version, but applying a combination of modern technologies in a de novo assembly to increase continuity, accuracy, and completeness. The assembly includes 2.7 Gb and is >250× more continuous than the original assembly, with contig N50 >25 Mb and L50 of 32. We also greatly expanded supporting RNA-based data for annotation that identifies 30,396 total genes (21,039 protein coding). The new reference assembly is accessible in annotated form for public use.ConclusionsWe demonstrate that improved continuity of assembled sequence warrants the adoption of ARS-UCD1.2 as the new cattle reference genome and that increased assembly accuracy will benefit future research on this species
- …
