17 research outputs found
SeqOthello: Querying RNA-Seq Experiments at Scale
We present SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs
NOVEL COMPUTATIONAL METHODS FOR SEQUENCING DATA ANALYSIS: MAPPING, QUERY, AND CLASSIFICATION
Over the past decade, the evolution of next-generation sequencing technology has considerably advanced the genomics research. As a consequence, fast and accurate computational methods are needed for analyzing the large data in different applications. The research presented in this dissertation focuses on three areas: RNA-seq read mapping, large-scale data query, and metagenomics sequence classification.
A critical step of RNA-seq data analysis is to map the RNA-seq reads onto a reference genome. This dissertation presents a novel splice alignment tool, MapSplice3. It achieves high read alignment and base mapping yields and is able to detect splice junctions, gene fusions, and circular RNAs comprehensively at the same time. Based on MapSplice3, we further extend a novel lightweight approach called iMapSplice that enables personalized mRNA transcriptional profiling. As huge amount of RNA-seq has been shared through public datasets, it provides invaluable resources for researchers to test hypotheses by reusing existing datasets. To meet the needs of efficiently querying large-scale sequencing data, a novel method, called SeqOthello, has been developed. It is able to efficiently query sequence k-mers against large-scale datasets and finally determines the existence of the given sequence. Metagenomics studies often generate tens of millions of reads to capture the presence of microbial organisms. Thus efficient and accurate algorithms are in high demand. In this dissertation, we introduce MetaOthello, a probabilistic hashing classifier for metagenomic sequences. It supports efficient query of a taxon using its k-mer signatures
ULTRA-FAST AND MEMORY-EFFICIENT LOOKUPS FOR CLOUD, NETWORKED SYSTEMS, AND MASSIVE DATA MANAGEMENT
Systems that process big data (e.g., high-traffic networks and large-scale storage) prefer data structures and algorithms with small memory and fast processing speed. Efficient and fast algorithms play an essential role in system design, despite the improvement of hardware. This dissertation is organized around a novel algorithm called Othello Hashing. Othello Hashing supports ultra-fast and memory-efficient key-value lookup, and it fits the requirements of the core algorithms of many large-scale systems and big data applications. Using Othello hashing, combined with domain expertise in cloud, computer networks, big data, and bioinformatics, I developed the following applications that resolve several major challenges in the area.
Concise: Forwarding Information Base. A Forwarding Information Base is a data structure used by the data plane of a forwarding device to determine the proper forwarding actions for packets. The polymorphic property of Othello Hashing the separation of its query and control functionalities, which is a perfect match to the programmable networks such as Software Defined Networks. Using Othello Hashing, we built a fast and scalable FIB named \textit{Concise}. Extensive evaluation results on three different platforms show that Concise outperforms other FIB designs.
SDLB: Cloud Load Balancer. In a cloud network, the layer-4 load balancer servers is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. We built a software load balancer with Othello Hashing techniques named SDLB. SDLB is able to accomplish two functionalities of the SDLB using one Othello query: to find the designated server for packets of ongoing sessions and to distribute new or session-free packets.
MetaOthello: Taxonomic Classification of Metagenomic Sequences. Metagenomic read classification is a critical step in the identification and quantification of microbial species sampled by high-throughput sequencing. Due to the growing popularity of metagenomic data in both basic science and clinical applications, as well as the increasing volume of data being generated, efficient and accurate algorithms are in high demand. We built a system to support efficient classification of taxonomic sequences using its k-mer signatures.
SeqOthello: RNA-seq Sequence Search Engine. Advances in the study of functional genomics produced a vast supply of RNA-seq datasets. However, how to quickly query and extract information from sequencing resources remains a challenging problem and has been the bottleneck for the broader dissemination of sequencing efforts. The challenge resides in both the sheer volume of the data and its nature of unstructured representation. Using the Othello Hashing techniques, we built the SeqOthello sequence search engine. SeqOthello is a reference-free, alignment-free, and parameter-free sequence search system that supports arbitrary sequence query against large collections of RNA-seq experiments, which enables large-scale integrative studies using sequence-level data
Telomere Roles in Fungal Genome Evolution and Adaptation
Telomeres form the ends of linear chromosomes and usually comprise protein complexes that bind to simple repeated sequence motifs that are added to the 3′ ends of DNA by the telomerase reverse transcriptase (TERT). One of the primary functions attributed to telomeres is to solve the “end-replication problem” which, if left unaddressed, would cause gradual, inexorable attrition of sequences from the chromosome ends and, eventually, loss of viability. Telomere-binding proteins also protect the chromosome from 5′ to 3′ exonuclease action, and disguise the chromosome ends from the double-strand break repair machinery whose illegitimate action potentially generates catastrophic chromosome aberrations. Telomeres are of special interest in the blast fungus, Pyricularia, because the adjacent regions are enriched in genes controlling interactions with host plants, and the chromosome ends show enhanced polymorphism and genetic instability. Previously, we showed that telomere instability in some P. oryzae strains is caused by novel retrotransposons (MoTeRs) that insert in telomere repeats, generating interstitial telomere sequences that drive frequent, break-induced rearrangements. Here, we sought to gain further insight on telomeric involvement in shaping Pyricularia genome architecture by characterizing sequence polymorphisms at chromosome ends, and surrounding internalized MoTeR loci (relics) and interstitial telomere repeats. This provided evidence that telomere dynamics have played historical, and likely ongoing, roles in shaping the Pyricularia genome. We further demonstrate that even telomeres lacking MoTeR insertions are poorly preserved, such that the telomere-adjacent sequences exhibit frequent presence/absence polymorphism, as well as exchanges with the genome interior. Using TERT knockout experiments, we characterized chromosomal responses to failed telomere maintenance which suggested that much of the MoTeR relic-/interstitial telomere-associated polymorphism could be driven by compromised telomere function. Finally, we describe three possible examples of a phenomenon known as “Adaptive Telomere Failure,” where spontaneous losses of telomere maintenance drive rapid accumulation of sequence polymorphism with possible adaptive advantages. Together, our data suggest that telomere maintenance is frequently compromised in Pyricularia but the chromosome alterations resulting from telomere failure are not as catastrophic as prior research would predict, and may, in fact, be potent drivers of adaptive polymorphism
Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments
Motivation
The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data.
Results
As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query
NOVEL COMPUTATIONAL METHODS FOR CANCER GENOMICS DATA ANALYSIS
Cancer is a genetic disease responsible for one in eight deaths worldwide. The advancement of next-generation sequencing (NGS) technology has revolutionized the cancer research, allowing comprehensively profiling the cancer genome at great resolution. Large-scale cancer genomics research has sparked the needs for efficient and accurate Bioinformatics methods to analyze the data. The research presented in this dissertation focuses on three areas in cancer genomics: cancer somatic mutation detection; cancer driver genes identification and transcriptome profiling on single-cell level.
NGS data analysis involves a series of complicated data transformation that convert raw sequencing data to the information that is interpretable by cancer researchers. The first project in the dissertation established a robust, reproducible and scalable cancer genomics data analysis workflow management system that automates the best practice mutation calling pipelines to detect somatic single nucleotide polymorphisms, insertion, deletion and copy number variation from NGS data. It integrates mutation annotation, clinically actionable therapy prediction and data visualization that streamlines the sequence-to-report data transformation.
In order to differentiate the driver mutations buried among a vast pool of passenger mutations from a somatic mutation calling project, we developed MEScan in the second project, a novel method that enables genome-scale driver mutations identification based on mutual exclusivity test using cancer somatic mutation data. MEScan implements an efficient statistical framework to de novo screen mutual exclusive patterns and in the meantime taking into account the patient-specific and gene-specific background mutation rate and adjusting the heterogenous mutation frequency. It outperforms several existing methods based on simulation studies and real-world datasets. Genome-wide screening using existing TCGA somatic mutation data discovers novel cancer-specific and pan-cancer mutually exclusive patterns.
Bulk RNA sequencing (RNA-Seq) has become one of the most commonly used techniques for transcriptome profiling in a wide spectrum of biomedical and biological research. Analyzing bulk RNA-Seq reads to quantify expression at each gene locus is the first step towards the identification of differentially expressed genes for downstream biological interpretation. Recent advances in single-cell RNA-seq (scRNA-seq) technology allows cancer biologists to profile gene expression on higher resolution cellular level. Preprocessing scRNA-seq data to quantify UMI-based gene count is the key to characterize intra-tumor cellular heterogeneity and identify rare cells that governs tumor progression, metastasis and treatment resistance. Despite its popularity, summarizing gene count from raw sequencing reads remains the one of the most time-consuming steps with existing tools. Current pipelines do not balance the efficiency and accuracy in large-scale gene count summarization in both bulk and scRNA-seq experiments. In the third project, we developed a light-weight k-mer based gene counting algorithm, FastCount, to accurately and efficiently quantify gene-level abundance using bulk RNA-seq or UMI-based scRNA-seq data. It achieves at least an order-of-magnitude speed improvement over the current gold standard pipelines while providing competitive accuracy
Improving the Compact Bit-Sliced Signature Index COBS for Large Scale Genomic Data
In this thesis we investigate the potential for improving the Compact Bit-Sliced Signature Index (COBS) [BBGI19] for large scale genomic data. COBS was developed by Bingmann et al. and is an inverted text index based on Bloom filters. It can be used to index k-mers of DNA samples or q-grams of plain text data and is queried using approximate pattern matching based on the k-mer (or q-gram) profile of a query. In their work Bingmann et al. demonstrated a couple of advantages COBS has over other state of the art approximate k-mer-based indices, some of which are extraordinary fast query and construction times, but as well as the fact that COBS can be constructed and queried even if the index does not fit into main memory. This is one of the reasons we decided to look more closely at some areas we could improve COBS. Our main goal is to make COBS more scalable. Scalability is a very important factor when it comes to handling DNA related data. This is because the amount of sequenced data stored in publicly available archives nearly doubles every year, making it difficult to handle even from the perspective of resources alone. We focus on two main areas in which we try to improve COBS. Those are index compression through clustering and distribution. The thesis presents our findings and improvements achieved in respect to those areas