Search CORE

55 research outputs found

Rainbowfish: A Succinct Colored de Bruijn Graph Representation

Author: Almodaresi Fatemeh
Pandey Prashant
Patro Rob
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 17th International Workshop on Algorithms in Bioinformatics (WABI 2017)
Publication date: 15/05/2017
Field of study

Dagstuhl Research Online Publication Server

Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth

Author: Chan Skylar
Fan Jason
Patro Rob
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)
Publication date: 01/01/2021
Field of study

There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Thus, we derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. To our knowledge, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth

PubMed Central

Dagstuhl Research Online Publication Server

Digital Repository at the University of Maryland

Fast, Parallel, and Cache-Friendly Suffix Array Construction

Author: Dhulipala Laxman
Khan Jamshed
Molloy Erin
Patro Rob
Rubel Tobias
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups

Dagstuhl Research Online Publication Server

TransRate: reference-free quality assessment of de novo transcriptome assemblies.

Author: Boursnell Chris
Hibberd Julian M
Kelly Steven
Patro Rob
Smith-Unna Richard
Publication venue: Genome Res
Publication date: 01/06/2016
Field of study

TransRate is a tool for reference-free quality assessment of de novo transcriptome assemblies. Using only the sequenced reads and the assembly as input, we show that multiple common artifacts of de novo transcriptome assembly can be readily detected. These include chimeras, structural errors, incomplete assembly, and base errors. TransRate evaluates these errors to produce a diagnostic quality score for each contig, and these contig scores are integrated to evaluate whole assemblies. Thus, TransRate can be used for de novo assembly filtering and optimization as well as comparison of assemblies generated using different methods from the same input reads. Applying the method to a data set of 155 published de novo transcriptome assemblies, we deconstruct the contribution that assembly method, read length, read quantity, and read quality make to the accuracy of de novo transcriptome assemblies and reveal that variance in the quality of the input data explains 43% of the variance in the quality of published de novo transcriptome assemblies. Because TransRate is reference-free, it is suitable for assessment of assemblies of all types of RNA, including assemblies of long noncoding RNA, rRNA, mRNA, and mixed RNA samples

Crossref

PubMed Central

Oxford University Research Archive

Apollo (Cambridge)

Fulgor: A Fast and Compact {k-mer} Index for Large-Scale Matching and Color Queries

Author: Fan Jason
Khan Jamshed
Patro Rob
Pibiri Giulio Ermanno
Singh Noor Pratap
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Publication date: 01/01/2023
Field of study

The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Swimming downstream: statistical analysis of differential transcript usage following Salmon quantification [version 2; referees: 1 approved, 2 approved with reservations]

Author: Charlotte Soneson
Michael I. Love
Rob Patro
Publication venue: 'F1000 Research Ltd'
Publication date: 01/09/2018
Field of study

Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data

Directory of Open Access Journals

Identification of alternative topological domains in chromatin

Author: Duggal Geet
Filippova Darya
Kingsford Carl
Patro Rob
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The ensemble of domains we identify allows us to quantify the degree to which the domain structure is hierarchical as opposed to overlapping, and our analysis reveals a pronounced hierarchical structure in which larger stable domains tend to completely contain smaller domains. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone marks at the boundaries

Crossref

Springer - Publisher Connector

PubMed Central

D-Scholarship@Pitt

Alevin efficiently estimates accurate gene abundances from dscRNA-seq data.

Author: Malik Laraib
Patro Rob
Smith Tom
Srivastava Avi
Sudbery Ian
Publication venue: Genome Biol
Publication date: 01/03/2019
Field of study

We introduce alevin, a fast end-to-end pipeline to process droplet-based single-cell RNA sequencing data, performing cell barcode detection, read mapping, unique molecular identifier (UMI) deduplication, gene count estimation, and cell barcode whitelisting. Alevin's approach to UMI deduplication considers transcript-level constraints on the molecules from which UMIs may have arisen and accounts for both gene-unique reads and reads that multimap between genes. This addresses the inherent bias in existing tools which discard gene-ambiguous reads and improves the accuracy of gene abundance estimates. Alevin is considerably faster, typically eight times, than existing gene quantification approaches, while also using less memory

Directory of Open Access Journals

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Apollo (Cambridge)

White Rose Research Online

FigShare

READABILITY ASSESSMENT ON THE TRANSLATION OF USER MANUAL OF SAMSUNG GALAXY TAB 2 7.0 AND THE FACTORS INFLUENCING IT

Author: Carl Kingsford (5455571)
Rob Patro (5455577)
Stephen M. Mount (5455622)
Publication venue
Publication date: 01/05/2014
Field of study

We introduce Sailfish, a computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Because Sailfish entirely avoids mapping reads, a time-consuming step in all current methods, it provides quantification estimates much faster than do existing approaches (typically 20 times faster) without loss of accuracy. By facilitating frequent reanalysis of data and reducing the need to optimize parameters, Sailfish exemplifies the potential of lightweight algorithms for efficiently processing sequencing reads.</p

Sebelas Maret Institutional Repository

DifferentialRegulation: a Bayesian hierarchical approach to identify differentially regulated genes

Author: Avalos-Pacheco Alejandra
Cai Peiying
He Dongze
Meili Joël
Patro Rob
Robinson Mark D
Sarkar Hirak
Soneson Charlotte
Tiberi Simone
Publication venue
Publication date: 17/08/2023
Field of study

MOTIVATION: Although transcriptomics data is typically used to analyse mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g., healthy vs . diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, i.e., reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. RESULTS: Here, we present DifferentialRegulation , a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, versus state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. AVAILABILITY AND IMPLEMENTATION: DifferentialRegulation is distributed as a Bioconductor R package

ZORA