Search CORE

72 research outputs found

Keep Me Around: Intron Retention Detection and Analysis

Author: Conboy John G.
Pachter Lior
Pimentel Harold
Publication venue
Publication date: 02/10/2015
Field of study

We present a tool, keep me around (kma), a suite of python scripts and an R package that finds retained introns in RNA-Seq experiments and incorporates biological replicates to reduce the number of false positives when detecting retention events. kma uses the results of existing quantification tools that probabilistically assign multi-mapping reads, thus interfacing easily with transcript quantification pipelines. The data is represented in a convenient, database style format that allows for easy aggregation across introns, genes, samples, and conditions to allow for further exploratory analysis

arXiv.org e-Print Archive

Caltech Authors

Near-optimal RNA-Seq quantification

Author: Bray Nicolas
Melsted Páll
Pachter Lior
Pimentel Harold
Publication venue
Publication date: 11/05/2015
Field of study

We present a novel approach to RNA-Seq quantification that is near optimal in speed and accuracy. Software implementing the approach, called kallisto, can be used to analyze 30 million unaligned paired-end RNA-Seq reads in less than 5 minutes on a standard laptop computer while providing results as accurate as those of the best existing tools. This removes a major computational bottleneck in RNA-Seq analysis.Comment: - Added some results (paralog analysis, allele specific expression analysis, alignment comparison, accuracy analysis with TPMs) - Switched bootstrap analysis to human sample from SEQC-MAQCIII - Provided link to a snakefile that allows for reproducibility of all results and figures in the pape

arXiv.org e-Print Archive

Caltech Authors

Zika infection of neural progenitor cells perturbs transcription in neurodevelopmental pathways

Author: Pachter Lior
Pimentel Harold
Yi Lynn
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2017
Field of study

Background: A recent study of the gene expression patterns of Zika virus (ZIKV) infected human neural progenitor cells (hNPCs) revealed transcriptional dysregulation and identified cell cycle-related pathways that are affected by infection. However deeper exploration of the information present in the RNA-Seq data can be used to further elucidate the manner in which Zika infection of hNPCs affects the transcriptome, refining pathway predictions and revealing isoform-specific dynamics. Methodology/Principal findings: We analyzed data published by Tang et al. using state-of-the-art tools for transcriptome analysis. By accounting for the experimental design and estimation of technical and inferential variance we were able to pinpoint Zika infection affected pathways that highlight Zika’s neural tropism. The examination of differential genes reveals cases of isoform divergence. Conclusions: Transcriptome analysis of Zika infected hNPCs has the potential to identify the molecular signatures of Zika infected neural cells. These signatures may be useful for diagnostics and for the resolution of infection pathways that can be used to harvest specific targets for further study

Directory of Open Access Journals

Caltech Authors

FigShare

dotears: Scalable, consistent DAG estimation using observational and interventional data

Author: Pimentel Harold
Rao Jingyou
Sankararaman Sriram
Xue Albert
Publication venue
Publication date: 30/05/2023
Field of study

Learning causal directed acyclic graphs (DAGs) from data is complicated by a lack of identifiability and the combinatorial space of solutions. Recent work has improved tractability of score-based structure learning of DAGs in observational data, but is sensitive to the structure of the exogenous error variances. On the other hand, learning exogenous variance structure from observational data requires prior knowledge of structure. Motivated by new biological technologies that link highly parallel gene interventions to a high-dimensional observation, we present

\texttt{dotears}

[doo-tairs], a scalable structure learning framework which leverages observational and interventional data to infer a single causal structure through continuous optimization.

\texttt{dotears}

exploits predictable structural consequences of interventions to directly estimate the exogenous error structure, bypassing the circular estimation problem. We extend previous work to show, both empirically and analytically, that the inferences of previous methods are driven by exogenous variance structure, but

\texttt{dotears}

is robust to exogenous variance structure. Across varied simulations of large random DAGs,

\texttt{dotears}

outperforms state-of-the-art methods in structure estimation. Finally, we show that

\texttt{dotears}

is a provably consistent estimator of the true DAG under mild assumptions

arXiv.org e-Print Archive

Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

Author: McGee Warren A.
Pachter Lior
Pimentel Harold
Wu Jane Y.
Publication venue
Publication date: 02/03/2019
Field of study

*Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the total RNA present. Thus, datasets carry only relative information, even though absolute RNA copy numbers are often of interest. Current normalization methods assume most features are not changing, which can lead to misleading conclusions when there are large shifts. However, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when such large shifts occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We tested several tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). For these tools, we compared their standard normalization to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. We show that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used by a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into account the compositional nature of the data. We conclude that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more broadly used in a compositional manner to minimize misleading conclusions from differential analyses

Caltech Authors

The Lair: a resource for exploratory analysis of published RNA-Seq data

Author: Bray Nicolas
Melsted Páll
Pachter Lior
Pimentel Harold
Sturmfels Pascal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/05/2016
Field of study

Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair.National Institutes of Health grants R01 HG006129, R01 DK094699 and R01 HG008164.Peer Reviewe

Crossref

Opin visindi

Springer - Publisher Connector

PubMed Central

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

Author: Kelley Ryan
Kim Daehwan
Pertea Geo
Pimentel Harold
Salzberg Steven L
Trapnell Cole
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat

Crossref

Harvard University - DASH

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Digital Repository at the University of Maryland

Fusion detection and quantification by pseudoalignment

Author: Bray Nicolas
Hateley Shannon
Joseph Isaac Charles
Melsted Páll
Pachter Lior
Pimentel Harold
Publication venue
Publication date: 20/07/2017
Field of study

RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly

Caltech Authors