Search CORE

168 research outputs found

Near-optimal RNA-Seq quantification

Author: Bray Nicolas
Melsted Páll
Pachter Lior
Pimentel Harold
Publication venue
Publication date: 11/05/2015
Field of study

We present a novel approach to RNA-Seq quantification that is near optimal in speed and accuracy. Software implementing the approach, called kallisto, can be used to analyze 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer while providing results as accurate as those of the best existing tools. This removes a major computational bottleneck in RNA-Seq analysis.Comment: - Added some results (paralog analysis, allele specific expression analysis, alignment comparison, accuracy analysis with TPMs) - Switched bootstrap analysis to human sample from SEQC-MAQCIII - Provided link to a snakefile that allows for reproducibility of all results and figures in the pape

arXiv.org e-Print Archive

Caltech Authors

Error and Error Mitigation in Low-Coverage Genome Assemblies

Author: Hubisz Melissa J.
Kellis Manolis
Lin Michael F.
Siepel Adam
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/11/2010
Field of study

The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ~2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1–4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644111)National Science Foundation (U.S.) (Faculty Early Career Development grant DBI-0644282)National Science Foundation (U.S.) (Faculty Early Career Development grant U54 HG004555-01)David & Lucile Packard FoundationDavid & Lucile Packard Foundation (Fellowship for Science and Engineering

Public Library of Science (PLOS)

CiteSeerX

DSpace@MIT

Cold Spring Harbor Laboratory Institutional Repository

Directory of Open Access Journals

PubMed Central

Paired is better: local assembly algorithms for NGS paired reads and applications to RNA-Seq

Author: Nadalin Francesca
Publication venue: Università degli Studi di Udine
Publication date: 12/05/2014
Field of study

The analysis of biological sequences is one of the main research areas of Bioinformatics. Sequencing data are the input for almost all types of studies concerning genomic as well as transcriptomic sequences, and sequencing experiments should be conceived specifically for each type of application. The challenges posed by fundamental biological questions are usually addressed by firstly aligning or assemblying the reads produced by new sequencing technologies. Assembly is the first step when a reference sequence is not available. Alignment of genomic reads towards a known genome is fundamental, e.g., to find the differences among organisms of related species, and to detect mutations proper of the so-called "diseases of the genome". Alignment of transcriptomic reads against a reference genome, allows to detect the expressed genes as well as to annotate and quantify alternative transcripts. In this thesis we overview the approaches proposed in literature for solving the above mentioned problems. In particular, we deeply analyze the sequence assembly problem, with particular emphasys on genome reconstruction, both from a more theoretical point of view and in light of the characteristics of sequencing data produced by state-of-the-art technologies. We also review the main steps in a pipeline for the analysis of the transcriptome, that is, alignment, assembly, and transcripts quantification, with particular emphasys on the opportunities given by RNA-Seq technologies in enhancing precision. The thesis is divided in two parts, the first one devoted to the study of local assembly methods for Next Generation Sequencing data, the second one concerning the development of tools for alignment of RNA-Seq reads and transcripts quantification. The permanent theme is the use of paired reads in all fields of applications discussed in this thesis. In particular, we emphasyze the benefits of assemblying inserts from paired reads in a wide range of applications, from de novo assembly, to the analysis of RNA. The main contribution of this thesis lies in the introduction of innovative tools, based on well-studied heuristics fine tuned on the data. Software is always tested to specifically assess the correctness of prediction. The aim is to produce robust methods, that, having low false positives rate, produce a certified output characterized by high specificity.openDottorato di ricerca in InformaticaopenNadalin, Francesc

Archivio istituzionale della ricerca - Università degli Studi di Udine

A Quantitative Exploration of Causes of False Positive Single Nucleotide Polymorphisms in Next-Generation Sequencing Data

Author: Bello Ribeiro Antonio Claudio
Publication venue
Publication date: 01/01/2016
Field of study

University of Dundee Online Publications

Optical map guided genome assembly

Author: A Gurevich
A Samad
A Valouev
AK-Y Leung
B Alipanahi
BK Stöcker
DE Jarvis
ET Dimalanta
FJ Sedlazeck
H Li
H Li
HC Lin
JM Shelton
LM Mendelowitz
MD Muggli
MD Muggli
MD Muggli
MS Waterman
N Daccord
N Nagarajan
R Walve
S Beier
S Koren
S Vij
W Pan
Y Dong
Publication venue
Publication date: 06/07/2020
Field of study

Background The long reads produced by third generation sequencing technologies have significantly boosted the results of genome assembly but still, genome-wide assemblies solely based on read data cannot be produced. Thus, for example, optical mapping data has been used to further improve genome assemblies but it has mostly been applied in a post-processing stage after contig assembly. Results We proposeOpticalKermitwhich directly integrates genome wide optical maps into contig assembly. We show how genome wide optical maps can be used to localize reads on the genome and then we adapt the Kermit method, which originally incorporated genetic linkage maps to the miniasm assembler, to use this information in contig assembly. Our experimental results show that incorporating genome wide optical maps to the contig assembly of miniasm increases NGA50 while the number of misassemblies decreases or stays the same. Furthermore, when compared to the Canu assembler,OpticalKermitproduces an assembly with almost three times higher NGA50 with a lower number of misassemblies on realA. thalianareads. Conclusions OpticalKermitsuccessfully incorporates optical mapping data directly to contig assembly of eukaryotic genomes. Our results show that this is a promising approach to improve the contiguity of genome assemblies.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Computational recovery of enzyme haplotypes from a metagenome

Author: Nicholls Samuel
Publication venue
Publication date: 01/01/2018
Field of study

Aberystwyth Research Portal

University of Birmingham Research Portal

New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

Author: Gatter Thomas
Publication venue
Publication date: 18/02/2022
Field of study

Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

The khmer software package: enabling efficient nucleotide sequence analysis

Author: Crusoe Michael R.
Skennerton Connor T.
Publication venue: F1000 Research Ltd.
Publication date: 25/09/2015
Field of study

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/

Caltech Authors

Recommended from our members

Computational methods for single cell RNA and genome assembly resolution using genetic variation

Author: Heaton William
Publication venue: University of Cambridge
Publication date: 09/12/2021
Field of study

Genetic variation and natural selection have driven the evolutionary history on this planet and are responsible for creating us and all other life as we know it. Over the past several decades, the genomic revolution has allowed us to assess population variation across humans and other species and use that to link genotypes with phenotypes and infer evolutionary histories. In this thesis, I explore computational methods for using genetic variation to demultiplex and disambiguate complex data. In single cell RNAseq, problems of batch effects, doublets, and ambient RNA are each sources of noise that impede our ability to infer the functional states of cells and compare them between experiments. One new popular new experimental design promising to solve each of these while also reducing experimental costs is mixturing multiple individuals' cells into a single experiment. In chapter 2, I present a method for clustering cells by genotype, calling doublets, and using the cross-genotype signal in singletons to estimate and remove ambient RNA. I compare this methods to other existing methods including one that requires \textit{a priori} information about the genotypes, and two which do not. I find that my method outperforms each of these methods across a wide range of data parameters and sample types. In genome assembly, the recent higher throughput and lower cost of long read sequencing has revolutionized our ability to create reference quality genomes and has revitalized the assembly community. Now, massive efforts are taking place in the Darwin Tree of Life project and the Earth Biogenome project to create reference genomes for all multicelular eukaryotic life. This will create a scientific resource for the next generation of biological science, will serve as a conservation of data that could otherwise be lost in this time of mass extinction, and will allow for a much more broad understanding of evolution and the evolutionary history of life on Earth. While much progress has been made in data quality and assembly algorithms, some problems still exist. Until recently, the DNA input requirements for long read sequencing technologies made it impossible to sequence single individuals of these species with long reads. Also, high heterozygosity makes assembly more difficult due to the inherent ambiguity between heterozygous sequence versus paralogous sequence when confronted with inexact homology. One solution to the DNA input requirements would be to pool individuals, but this only increases the heterozygosity of the sample and reduces assembly quality. In chapter 3, we present the first high quality assembly of a single mosquito using new library preparation methods with reduced DNA requirements. This reduces the number of haplotypes to two, improving the assembly quality. In chapter 4, we further address the problems brought on by heterozygosity in assembly. I present a suite of tools that use the phasing consistency of multiple heterozygous sequences as a signal for physical linkage, thus using genetic variation to our advantage rather than as a challenge to overcome. This tool creates phased, linked assemblies and phasing aware scaffolding. Further, I provide a tool for phasing aware scaffolding on existing assemblies. This includes a novel haplotype phasing algorithm with some unique beneficial properties. It is robust to non-heterozygous variants as input and can detect and correct those genotypes. And it naturally extends to polyploid genomes.Wellcome Trus

Apollo (Cambridge)

Facilitated sequence counting and assembly by template mutagenesis

Author: Levy D.
Wigler M.
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 13/10/2014
Field of study

Presently, inferring the long-range structure of the DNA templates is limited by short read lengths. Accurate template counts suffer from distortions occurring during PCR amplification. We explore the utility of introducing random mutations in identical or nearly identical templates to create distinguishable patterns that are inherited during subsequent copying. We simulate the applications of this process under assumptions of error-free sequencing and perfect mapping, using cytosine deamination as a model for mutation. The simulations demonstrate that within readily achievable conditions of nucleotide conversion and sequence coverage, we can accurately count the number of otherwise identical molecules as well as connect variants separated by long spans of identical sequence. We discuss many potential applications, such as transcript profiling, isoform assembly, haplotype phasing, and de novo genome assembly

Cold Spring Harbor Laboratory Institutional Repository

PubMed Central