72 research outputs found
Keep Me Around: Intron Retention Detection and Analysis
We present a tool, keep me around (kma), a suite of python scripts and an R
package that finds retained introns in RNA-Seq experiments and incorporates
biological replicates to reduce the number of false positives when detecting
retention events. kma uses the results of existing quantification tools that
probabilistically assign multi-mapping reads, thus interfacing easily with
transcript quantification pipelines. The data is represented in a convenient,
database style format that allows for easy aggregation across introns, genes,
samples, and conditions to allow for further exploratory analysis
Near-optimal RNA-Seq quantification
We present a novel approach to RNA-Seq quantification that is near optimal in
speed and accuracy. Software implementing the approach, called kallisto, can be
used to analyze 30 million unaligned paired-end RNA-Seq reads in less than 5
minutes on a standard laptop computer while providing results as accurate as
those of the best existing tools. This removes a major computational bottleneck
in RNA-Seq analysis.Comment: - Added some results (paralog analysis, allele specific expression
analysis, alignment comparison, accuracy analysis with TPMs) - Switched
bootstrap analysis to human sample from SEQC-MAQCIII - Provided link to a
snakefile that allows for reproducibility of all results and figures in the
pape
Zika infection of neural progenitor cells perturbs transcription in neurodevelopmental pathways
Background: A recent study of the gene expression patterns of Zika virus (ZIKV) infected human neural progenitor cells (hNPCs) revealed transcriptional dysregulation and identified cell cycle-related pathways that are affected by infection. However deeper exploration of the information present in the RNA-Seq data can be used to further elucidate the manner in which Zika infection of hNPCs affects the transcriptome, refining pathway predictions and revealing isoform-specific dynamics.
Methodology/Principal findings: We analyzed data published by Tang et al. using state-of-the-art tools for transcriptome analysis. By accounting for the experimental design and estimation of technical and inferential variance we were able to pinpoint Zika infection affected pathways that highlight Zika’s neural tropism. The examination of differential genes reveals cases of isoform divergence.
Conclusions: Transcriptome analysis of Zika infected hNPCs has the potential to identify the molecular signatures of Zika infected neural cells. These signatures may be useful for diagnostics and for the resolution of infection pathways that can be used to harvest specific targets for further study
dotears: Scalable, consistent DAG estimation using observational and interventional data
Learning causal directed acyclic graphs (DAGs) from data is complicated by a
lack of identifiability and the combinatorial space of solutions. Recent work
has improved tractability of score-based structure learning of DAGs in
observational data, but is sensitive to the structure of the exogenous error
variances. On the other hand, learning exogenous variance structure from
observational data requires prior knowledge of structure. Motivated by new
biological technologies that link highly parallel gene interventions to a
high-dimensional observation, we present [doo-tairs], a
scalable structure learning framework which leverages observational and
interventional data to infer a single causal structure through continuous
optimization. exploits predictable structural consequences
of interventions to directly estimate the exogenous error structure, bypassing
the circular estimation problem. We extend previous work to show, both
empirically and analytically, that the inferences of previous methods are
driven by exogenous variance structure, but is robust to
exogenous variance structure. Across varied simulations of large random DAGs,
outperforms state-of-the-art methods in structure
estimation. Finally, we show that is a provably consistent
estimator of the true DAG under mild assumptions
Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data
*Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the total RNA present. Thus, datasets carry only relative information, even though absolute RNA copy numbers are often of interest. Current normalization methods assume most features are not changing, which can lead to misleading conclusions when there are large shifts. However, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when such large shifts occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We tested several tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). For these tools, we compared their standard normalization to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. We show that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used by a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into account the compositional nature of the data. We conclude that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more broadly used in a compositional manner to minimize misleading conclusions from differential analyses
The Lair: a resource for exploratory analysis of published RNA-Seq data
Increased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is difficult to use in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Sequence Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair.National Institutes of Health grants R01 HG006129, R01 DK094699 and R01 HG008164.Peer Reviewe
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat
Fusion detection and quantification by pseudoalignment
RNA sequencing in cancer cells is a powerful technique to detect chromosomal rearrangements, allowing for de novo discovery of actively expressed fusion genes. Here we focus on the problem of detecting gene fusions from raw sequencing data, assembling the reads to define fusion transcripts and their associated breakpoints, and quantifying their abundances. Building on the pseudoalignment idea that simplifies and accelerates transcript quantification, we introduce a novel approach to fusion detection based on inspecting paired reads that cannot be pseudoaligned due to conflicting matches. The method and software, called pizzly, filters false positives, assembles new transcripts from the fusion reads, and reports candidate fusions. With pizzly, fusion detection from raw RNA-Seq reads can be performed in a matter of minutes, making the program suitable for the analysis of large cancer gene expression databases and for clinical use. pizzly is available at https://github.com/pmelsted/pizzly
- …