101 research outputs found
Algorithms for Transcriptome Quantification and Reconstruction from RNA-Seq Data
Massively parallel whole transcriptome sequencing and its ability to generate full transcriptome data at the single transcript level provides a powerful tool with multiple interrelated applications, including transcriptome reconstruction, gene/isoform expression estimation, also known as transcriptome quantification. As a result, whole transcriptome sequencing has become the technology of choice for performing transcriptome analysis, rapidly replacing array-based technologies. The most commonly used transcriptome sequencing protocol, referred to as RNA-Seq, generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments. RNA-Seq protocol reduces the sequencing cost and significantly increases data throughput, but is computationally challenging to reconstruct full-length transcripts and accurately estimate their abundances across all cell types.
We focus on two main problems in transcriptome data analysis, namely, transcriptome reconstruction and quantification. Transcriptome reconstruction, also referred to as novel isoform discovery, is the problem of reconstructing the transcript sequences from the sequencing data. Reconstruction can be done de novo or it can be assisted by existing genome and transcriptome annotations. Transcriptome quantification refers to the problem of estimating the expression level of each transcript. We present a genome-guided and annotation-guided transcriptome reconstruction methods as well as methods for transcript and gene expression level estimation. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to previous methods
Recommended from our members
Transcriptional profiling of single fiber cells in a transgenic paradigm of an inherited childhood cataract reveals absence of molecular heterogeneity.
Our recent single-cell transcriptomic analysis has demonstrated that heterogeneous transcriptional activity attends molecular transition from the nascent to terminally differentiated fiber cells in the developing mouse lens. To understand the role of transcriptional heterogeneity in terminal differentiation and the functional phenotype (transparency) of this tissue, here we present a single-cell analysis of the developing lens, in a transgenic paradigm of an inherited pathology, known as the lamellar cataract. Cataracts hinder transmission of light into the eye. Lamellar cataract is the most prevalent bilateral childhood cataract. In this disease of early infancy, initially, the opacities remain confined to a few fiber cells, thus presenting an opportunity to investigate early molecular events that lead to cataractogenesis. We used a previously established paradigm that faithfully recapitulates this disease in transgenic mice. About 500 single fiber cells, manually isolated from a 2-day-old transgenic lens were interrogated individually for the expression of all known 17 crystallins and 78 other relevant genes using a Biomark HD (Fluidigm). We find that fiber cells from spatially and developmentally discrete regions of the transgenic (cataract) lens show remarkable absence of the heterogeneity of gene expression. Importantly, the molecular variability of cortical fiber cells, the hallmark of the WT lens, is absent in the transgenic cataract, suggesting absence of specific cell-type(s). Interestingly, we find a repetitive pattern of gene activity in progressive states of differentiation in the transgenic lens. This molecular dysfunction portends pathology much before the physical manifestations of the disease
TRIP: A method for novel transcript reconstruction from paired-end RNA-seq reads
Preliminary experimental results on synthetic datasets generated with various sequencing parameters and distribution assumptions show that TRIP has increased transcriptome reconstruction accuracy compared to previous methods that ignore fragment length distribution information
The Pioneer Advantage: Filling the blank spots on the map of genome diversity in Europe
Documenting genome diversity is important for the local biomedical communities and instrumental in developing precision and personalized medicine. Currently, tens of thousands of whole-genome sequences from Europe are publicly available, but most of these represent populations of developed countries of Europe. The uneven distribution of the available data is further impaired by the lack of data sharing. Recent whole-genome studies in Eastern Europe, one in Ukraine and one in Russia, demonstrated that local genome diversity and population structure from Eastern Europe historically had not been fully represented. An unexpected wealth of genomic variation uncovered in these studies was not so much a consequence of high variation within their population, but rather due to the “pioneer advantage.” We discovered more variants because we were the first to prospect in the Eastern European genome pool. This simple comparison underscores the importance of removing the remaining geographic genome deserts from the rest of the world map of the human genome diversity
Accurate Viral Population Assembly From Ultra-Deep Sequencing Data
Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads
MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation
Metagenomics, the study of genome sequences of diverse organisms cohabiting
in a shared environment, has experienced significant advancements across
various medical and biological fields. Metagenomic analysis is crucial, for
instance, in clinical applications such as infectious disease screening and the
diagnosis and early detection of diseases such as cancer. A key task in
metagenomics is to determine the species present in a sample and their relative
abundances. Currently, the field is dominated by either alignment-based tools,
which offer high accuracy but are computationally expensive, or alignment-free
tools, which are fast but lack the needed accuracy for many applications. In
response to this dichotomy, we introduce MetaTrinity, a tool based on
heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff
over existing methods. We benchmark MetaTrinity against two leading metagenomic
classifiers, each representing different ends of the performance-accuracy
spectrum. On one end, Kraken2, a tool optimized for performance, shows modest
accuracy yet a rapid runtime. The other end of the spectrum is governed by
Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity
achieves an accuracy comparable to Metalign while gaining a 4x speedup without
any loss in accuracy. This directly equates to a fourfold improvement in
runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x
longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a
3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual
comparison positions MetaTrinity as a broadly applicable solution for
metagenomic classification, combining advantages of both ends of the spectrum:
speed and accuracy. MetaTrinity is publicly available at
https://github.com/CMU-SAFARI/MetaTrinity
SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences
Computational complexity is a key limitation of genomic analyses. Thus, over
the last 30 years, researchers have proposed numerous fast heuristic methods
that provide computational relief. Comparing genomic sequences is one of the
most fundamental computational steps in most genomic analyses. Due to its high
computational complexity, optimized exact and heuristic algorithms are still
being developed. We find that these methods are highly sensitive to the
underlying data, its quality, and various hyperparameters. Despite their wide
use, no in-depth analysis has been performed, potentially falsely discarding
genetic sequences from further analysis and unnecessarily inflating
computational costs. We provide the first analysis and benchmark of this
heterogeneity. We deliver an actionable overview of the 11 most widely used
state-of-the-art methods for comparing genomic sequences. We also inform
readers about their advantages and downsides using thorough experimental
evaluation and different real datasets from all major manufacturers (i.e.,
Illumina, ONT, and PacBio). SequenceLab is publicly available at
https://github.com/CMU-SAFARI/SequenceLab
CRISPR-Mediated VHL Knockout Generates an Improved Model for Metastatic Renal Cell Carcinoma.
Metastatic renal cell carcinoma (mRCC) is nearly incurable and accounts for most of the mortality associated with RCC. Von Hippel Lindau (VHL) is a tumour suppressor that is lost in the majority of clear cell RCC (ccRCC) cases. Its role in regulating hypoxia-inducible factors-1α (HIF-1α) and -2α (HIF-2α) is well-studied. Recent work has demonstrated that VHL knock down induces an epithelial-mesenchymal transition (EMT) phenotype. In this study we showed that a CRISPR/Cas9-mediated knock out of VHL in the RENCA model leads to morphologic and molecular changes indicative of EMT, which in turn drives increased metastasis to the lungs. RENCA cells deficient in HIF-1α failed to undergo EMT changes upon VHL knockout. RNA-seq revealed several HIF-1α-regulated genes that are upregulated in our VHL knockout cells and whose overexpression signifies an aggressive form of ccRCC in the cancer genome atlas (TCGA) database. Independent validation in a new clinical dataset confirms the upregulation of these genes in ccRCC samples compared to adjacent normal tissue. Our findings indicate that loss of VHL could be driving tumour cell dissemination through stabilization of HIF-1α in RCC. A better understanding of the mechanisms involved in this phenomenon can guide the search for more effective treatments to combat mRCC
- …