Search CORE

94 research outputs found

Algorithms for Transcriptome Quantification and Reconstruction from RNA-Seq Data

Author: Mangul Serghei
Publication venue: ScholarWorks @ Georgia State University
Publication date: 16/11/2012
Field of study

Massively parallel whole transcriptome sequencing and its ability to generate full transcriptome data at the single transcript level provides a powerful tool with multiple interrelated applications, including transcriptome reconstruction, gene/isoform expression estimation, also known as transcriptome quantification. As a result, whole transcriptome sequencing has become the technology of choice for performing transcriptome analysis, rapidly replacing array-based technologies. The most commonly used transcriptome sequencing protocol, referred to as RNA-Seq, generates short (single or paired) sequencing tags from the ends of randomly generated cDNA fragments. RNA-Seq protocol reduces the sequencing cost and significantly increases data throughput, but is computationally challenging to reconstruct full-length transcripts and accurately estimate their abundances across all cell types. We focus on two main problems in transcriptome data analysis, namely, transcriptome reconstruction and quantification. Transcriptome reconstruction, also referred to as novel isoform discovery, is the problem of reconstructing the transcript sequences from the sequencing data. Reconstruction can be done de novo or it can be assisted by existing genome and transcriptome annotations. Transcriptome quantification refers to the problem of estimating the expression level of each transcript. We present a genome-guided and annotation-guided transcriptome reconstruction methods as well as methods for transcript and gene expression level estimation. Empirical results on both synthetic and real RNA-seq datasets show that the proposed methods improve transcriptome quantification and reconstruction accuracy compared to previous methods

ScholarWorks @ Georgia State University

Recommended from our members

Transcriptional profiling of single fiber cells in a transgenic paradigm of an inherited childhood cataract reveals absence of molecular heterogeneity.

Author: Bhat Suraj P
Elashoff David
Gangalum Rajendra K
Kashyap Raj K
Kim Dongjae
Mangul Serghei
Zhou Xinkai
Publication venue: eScholarship, University of California
Publication date: 01/09/2019
Field of study

Our recent single-cell transcriptomic analysis has demonstrated that heterogeneous transcriptional activity attends molecular transition from the nascent to terminally differentiated fiber cells in the developing mouse lens. To understand the role of transcriptional heterogeneity in terminal differentiation and the functional phenotype (transparency) of this tissue, here we present a single-cell analysis of the developing lens, in a transgenic paradigm of an inherited pathology, known as the lamellar cataract. Cataracts hinder transmission of light into the eye. Lamellar cataract is the most prevalent bilateral childhood cataract. In this disease of early infancy, initially, the opacities remain confined to a few fiber cells, thus presenting an opportunity to investigate early molecular events that lead to cataractogenesis. We used a previously established paradigm that faithfully recapitulates this disease in transgenic mice. About 500 single fiber cells, manually isolated from a 2-day-old transgenic lens were interrogated individually for the expression of all known 17 crystallins and 78 other relevant genes using a Biomark HD (Fluidigm). We find that fiber cells from spatially and developmentally discrete regions of the transgenic (cataract) lens show remarkable absence of the heterogeneity of gene expression. Importantly, the molecular variability of cortical fiber cells, the hallmark of the WT lens, is absent in the transgenic cataract, suggesting absence of specific cell-type(s). Interestingly, we find a repetitive pattern of gene activity in progressive states of differentiation in the transgenic lens. This molecular dysfunction portends pathology much before the physical manifestations of the disease

eScholarship - University of California

TRIP: A method for novel transcript reconstruction from paired-end RNA-seq reads

Author: Brinza Dumitru
Caciula Adrian
Mangul Serghei
Măndoiu Ion I
Zelikovskiy Alexander
Publication venue: ScholarWorks @ Georgia State University
Publication date: 01/01/2012
Field of study

Preliminary experimental results on synthetic datasets generated with various sequencing parameters and distribution assumptions show that TRIP has increased transcriptome reconstruction accuracy compared to previous methods that ignore fragment length distribution information

Crossref

ScholarWorks @ Georgia State University

Springer - Publisher Connector

PubMed Central

The Pioneer Advantage: Filling the blank spots on the map of genome diversity in Europe

Author: Mangul Serghei
O\u27Brien Stephen James
Oleksyk Taras K.
Schubelka Khrystyna
Wolfsberger Walter
Publication venue: NSUWorks
Publication date: 09/09/2022
Field of study

Documenting genome diversity is important for the local biomedical communities and instrumental in developing precision and personalized medicine. Currently, tens of thousands of whole-genome sequences from Europe are publicly available, but most of these represent populations of developed countries of Europe. The uneven distribution of the available data is further impaired by the lack of data sharing. Recent whole-genome studies in Eastern Europe, one in Ukraine and one in Russia, demonstrated that local genome diversity and population structure from Eastern Europe historically had not been fully represented. An unexpected wealth of genomic variation uncovered in these studies was not so much a consequence of high variation within their population, but rather due to the “pioneer advantage.” We discovered more variants because we were the first to prospect in the Eastern European genome pool. This simple comparison underscores the importance of removing the remaining geographic genome deserts from the rest of the world map of the human genome diversity

PubMed Central

NSU Works

Accurate Viral Population Assembly From Ultra-Deep Sequencing Data

Author: Eskin Eleazar
Mancuso Nicholas
Mangul Serghei
Sun Ren
Wu Nicholas C.
Zelikovskiy Alexander
Publication venue: ScholarWorks @ Georgia State University
Publication date: 01/06/2014
Field of study

Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads

Crossref

ScholarWorks @ Georgia State University

PubMed Central

eScholarship - University of California

CRISPR-Mediated VHL Knockout Generates an Improved Model for Metastatic Renal Cell Carcinoma.

Author: Guan Wei
Hermann Kip
Hu Junhui
Lin Lucia C
Liu Peijun
Mangul Serghei
Moughon Diana L
Pellegrini Matteo
Schokrpur Shiruyeh
Wu Lily
Xu Hua
Publication venue: eScholarship, University of California
Publication date: 01/06/2016
Field of study

Metastatic renal cell carcinoma (mRCC) is nearly incurable and accounts for most of the mortality associated with RCC. Von Hippel Lindau (VHL) is a tumour suppressor that is lost in the majority of clear cell RCC (ccRCC) cases. Its role in regulating hypoxia-inducible factors-1α (HIF-1α) and -2α (HIF-2α) is well-studied. Recent work has demonstrated that VHL knock down induces an epithelial-mesenchymal transition (EMT) phenotype. In this study we showed that a CRISPR/Cas9-mediated knock out of VHL in the RENCA model leads to morphologic and molecular changes indicative of EMT, which in turn drives increased metastasis to the lungs. RENCA cells deficient in HIF-1α failed to undergo EMT changes upon VHL knockout. RNA-seq revealed several HIF-1α-regulated genes that are upregulated in our VHL knockout cells and whose overexpression signifies an aggressive form of ccRCC in the cancer genome atlas (TCGA) database. Independent validation in a new clinical dataset confirms the upregulation of these genes in ccRCC samples compared to adjacent normal tissue. Our findings indicate that loss of VHL could be driving tumour cell dissemination through stabilization of HIF-1α in RCC. A better understanding of the mechanisms involved in this phenomenon can guide the search for more effective treatments to combat mRCC

PubMed Central

eScholarship - University of California

SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences

Author: Almadhoun Nour
Alser Mohammed
Firtina Can
Gollwitzer Arvid E.
Lindegger Joel
Mangul Serghei
Mutlu Onur
Rumpf Maximilian-David
Publication venue
Publication date: 21/01/2024
Field of study

Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of the 11 most widely used state-of-the-art methods for comparing genomic sequences. We also inform readers about their advantages and downsides using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available at https://github.com/CMU-SAFARI/SequenceLab

arXiv.org e-Print Archive

MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation

Author: Alser Mohammed
Bergtholdt Joel
Firtina Can
Gollwitzer Arvid E.
Lindegger Joel
Mangul Serghei
Mutlu Onur
Rumpf Maximilian-David
Publication venue
Publication date: 16/02/2024
Field of study

Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to determine the species present in a sample and their relative abundances. Currently, the field is dominated by either alignment-based tools, which offer high accuracy but are computationally expensive, or alignment-free tools, which are fast but lack the needed accuracy for many applications. In response to this dichotomy, we introduce MetaTrinity, a tool based on heuristics, to achieve a fundamental improvement in accuracy-runtime tradeoff over existing methods. We benchmark MetaTrinity against two leading metagenomic classifiers, each representing different ends of the performance-accuracy spectrum. On one end, Kraken2, a tool optimized for performance, shows modest accuracy yet a rapid runtime. The other end of the spectrum is governed by Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity achieves an accuracy comparable to Metalign while gaining a 4x speedup without any loss in accuracy. This directly equates to a fourfold improvement in runtime-accuracy tradeoff. Compared to Kraken2, MetaTrinity requires a 5x longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a 3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual comparison positions MetaTrinity as a broadly applicable solution for metagenomic classification, combining advantages of both ends of the spectrum: speed and accuracy. MetaTrinity is publicly available at https://github.com/CMU-SAFARI/MetaTrinity

arXiv.org e-Print Archive