94 research outputs found

    Algorithms for Transcriptome Quantification and Reconstruction from RNA-Seq Data

    Get PDF
    Massively parallel whole transcriptome sequencing and its ability to generate full transcriptome data at the single transcript level provides a powerful tool with multiple interrelated applications, including transcriptome reconstruction, gene/isoform expression estimation, also known as transcriptome quantification. As a result, whole transcriptome sequencing has become the technology of choice for performing transcriptome analysis, rapidly replacing array-based technologies. The most commonly used transcriptome sequencing protocol, referred to as RNA-Seq, generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments. RNA-Seq protocol reduces the sequencing cost and significantly increases data throughput, but is computationally challenging to reconstruct full-length transcripts and accurately estimate their abundances across all cell types. We focus on two main problems in transcriptome data analysis, namely, transcriptome reconstruction and quantification. Transcriptome reconstruction, also referred to as novel isoform discovery, is the problem of reconstructing the transcript sequences from the sequencing data. Reconstruction can be done de novo or it can be assisted by existing genome and transcriptome annotations. Transcriptome quantification refers to the problem of estimating the expression level of each transcript. We present a genome-guided and annotation-guided transcriptome reconstruction methods as well as methods for transcript and gene expression level estimation. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to previous methods

    TRIP: A method for novel transcript reconstruction from paired-end RNA-seq reads

    Get PDF
    Preliminary experimental results on synthetic datasets generated with various sequencing parameters and distribution assumptions show that TRIP has increased transcriptome reconstruction accuracy compared to previous methods that ignore fragment length distribution information

    The Pioneer Advantage: Filling the blank spots on the map of genome diversity in Europe

    Get PDF
    Documenting genome diversity is important for the local biomedical communities and instrumental in developing precision and personalized medicine. Currently, tens of thousands of whole-genome sequences from Europe are publicly available, but most of these represent populations of developed countries of Europe. The uneven distribution of the available data is further impaired by the lack of data sharing. Recent whole-genome studies in Eastern Europe, one in Ukraine and one in Russia, demonstrated that local genome diversity and population structure from Eastern Europe historically had not been fully represented. An unexpected wealth of genomic variation uncovered in these studies was not so much a consequence of high variation within their population, but rather due to the “pioneer advantage.” We discovered more variants because we were the first to prospect in the Eastern European genome pool. This simple comparison underscores the importance of removing the remaining geographic genome deserts from the rest of the world map of the human genome diversity

    Accurate Viral Population Assembly From Ultra-Deep Sequencing Data

    Get PDF
    Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads

    CRISPR-Mediated VHL Knockout Generates an Improved Model for Metastatic Renal Cell Carcinoma.

    Get PDF
    Metastatic renal cell carcinoma (mRCC) is nearly incurable and accounts for most of the mortality associated with RCC. Von Hippel Lindau (VHL) is a tumour suppressor that is lost in the majority of clear cell RCC (ccRCC) cases. Its role in regulating hypoxia-inducible factors-1α (HIF-1α) and -2α (HIF-2α) is well-studied. Recent work has demonstrated that VHL knock down induces an epithelial-mesenchymal transition (EMT) phenotype. In this study we showed that a CRISPR/Cas9-mediated knock out of VHL in the RENCA model leads to morphologic and molecular changes indicative of EMT, which in turn drives increased metastasis to the lungs. RENCA cells deficient in HIF-1α failed to undergo EMT changes upon VHL knockout. RNA-seq revealed several HIF-1α-regulated genes that are upregulated in our VHL knockout cells and whose overexpression signifies an aggressive form of ccRCC in the cancer genome atlas (TCGA) database. Independent validation in a new clinical dataset confirms the upregulation of these genes in ccRCC samples compared to adjacent normal tissue. Our findings indicate that loss of VHL could be driving tumour cell dissemination through stabilization of HIF-1α in RCC. A better understanding of the mechanisms involved in this phenomenon can guide the search for more effective treatments to combat mRCC

    SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences

    Full text link
    Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of the 11 most widely used state-of-the-art methods for comparing genomic sequences. We also inform readers about their advantages and downsides using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available at https://github.com/CMU-SAFARI/SequenceLab

    MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation

    Full text link
    Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to determine the species present in a sample and their relative abundances. Currently, the field is dominated by either alignment-based tools, which offer high accuracy but are computationally expensive, or alignment-free tools, which are fast but lack the needed accuracy for many applications. In response to this dichotomy, we introduce MetaTrinity, a tool based on heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff over existing methods. We benchmark MetaTrinity against two leading metagenomic classifiers, each representing different ends of the performance-accuracy spectrum. On one end, Kraken2, a tool optimized for performance, shows modest accuracy yet a rapid runtime. The other end of the spectrum is governed by Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity achieves an accuracy comparable to Metalign while gaining a 4x speedup without any loss in accuracy. This directly equates to a fourfold improvement in runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a 3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual comparison positions MetaTrinity as a broadly applicable solution for metagenomic classification, combining advantages of both ends of the spectrum: speed and accuracy. MetaTrinity is publicly available at https://github.com/CMU-SAFARI/MetaTrinity
    corecore