Search CORE

3 research outputs found

Compressing Genome Resequencing Data

Author: Farheen Aliya
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2016
Field of study

Recent improvements in high-throughput next generation sequencing (NGS) technologies have led to an exponential increase in the number, size and diversity of available complete genome sequences. This poses major problems in storage, transmission and analysis of such genomic sequence data. Thus, a substantial effort has been made to develop effective data compression techniques to reduce the storage requirements, improve the transmission speed, and analyze the compressed sequences for possible information about genomic structure or determine relationships between genomes from multiple organisms.;In this thesis, we study the problem of lossless compression of genome resequencing data using a reference-based approach. The thesis is divided in two major parts. In the first part, we perform a detailed empirical analysis of a recently proposed compression scheme called MLCX (Maximal Longest Common Substring/Subsequence). This led to a novel decomposition technique that resulted in an enhanced compression using MLCX. In the second part, we propose SMLCX, a new reference-based lossless compression scheme that builds on the MLCX. This scheme performs compression by encoding common substrings based on a sorted order, which significantly improved compression performance over the original MLCX method. Using SMLCX, we compressed the Homo sapiens genome with original size of 3,080,436,051 bytes to 6,332,488 bytes, for an overall compression ratio of 486. This can be compared to the performance of current state-of-the-art compression methods, with compression ratios of 157 (Wang et.al, Nucleic Acid Research, 2011), 171 (Pinho et.al, Nucleic Acid Research, 2011) and 360 (Beal et.al, BMC Genomics, 2016)

The Research Repository @ WVU (West Virginia University)

A New Algorithm for “the LCS problem” with Application in Compressing Genome Resequencing Data

Author: Adjeroh Donald
Afrin Tazin
Beal Richard
Farheen Aliya
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2016
Field of study

Background: The longest common subsequence (LCS) problem is a classical problem in computer science, and forms the basis of the current best-performing reference-based compression schemes for genome resequencing data. Methods: First, we present a new algorithm for the LCS problem. Using the generalized suffix tree, we identify the common substrings shared between the two input sequences. Using the maximal common substrings, we construct a directed acyclic graph (DAG), based on which we determine the LCS as the longest path in the DAG. Then, we introduce an LCS-motivated reference-based compression scheme using the components of the LCS, rather than the LCS itself. Results: Our basic scheme compressed the Homo sapiens genome (with an original size of 3,080,436,051 bytes) to 15,460,478 bytes. An improvement on the basic method further reduced this to 8,556,708 bytes, or an overall compression ratio of 360. This can be compared to the previous state-of-the-art compression ratios of 157 (Wang and Zhang, 2011) and 171 (Pinho, Pratas, and Garcia, 2011). Conclusion: We propose a new algorithm to address the longest common subsequence problem. Motivated by our LCS algorithm, we introduce a new reference-based compression scheme for genome resequencing data. Comparative results against state-of-the-art reference-based compression algorithms demonstrate the performance of the proposed method

PubMed Central

The Research Repository @ WVU (West Virginia University)

A new algorithm for “the LCS problem” with application in compressing genome resequencing data

Author: A Apostolico
AJ Coxm
AJ Pinho
Aliya Farheen
C Wang
C-E Kuo
CG Nevill-Manning
D Adjeroh
D Gusfield
D Maier
Donald Adjeroh
DS Hirschberg
E Ukkonen
EW Myers
F Hach
G Jacobson
J Aach
J Yang
JW Hunt
M Crochemore
M Crochemore
M Fritz
PA Pevzner
R Beal
R Beal
R Giancarlo
R Giancarlo
Richard Beal
S Wandelt
S Wandelt
S Wandelt
Tazin Afrin
TF Smith
TH Cormen
Z Lin
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref