Search CORE

16 research outputs found

Damming the genomic data flood using a comprehensive analysis and storage data structure

Author: Andrew M.K. Brown
Antofie
Barbasiewicz
Barsky
Butts
Cohn
Colliat
Cooper
De Francesco
Fayyad
Férey
Hong
Jean-Claude Tardif
Marc Bouffard
Michael S. Phillips
Moore
Olund
Phoophakdee
Purcell
Ramakrishnan
Sen
Sharon Marsh
Tibor van Rooij
Wall
Publication venue: Oxford University Press
Publication date
Field of study

Data generation, driven by rapid advances in genomic technologies, is fast outpacing our analysis capabilities. Faced with this flood of data, more hardware and software resources are added to accommodate data sets whose structure has not specifically been designed for analysis. This leads to unnecessarily lengthy processing times and excessive data handling and storage costs. Current efforts to address this have centered on developing new indexing schemas and analysis algorithms, whereas the root of the problem lies in the format of the data itself. We have developed a new data structure for storing and analyzing genotype and phenotype data. By leveraging data normalization techniques, database management system capabilities and the use of a novel multi-table, multidimensional database structure we have eliminated the following: (i) unnecessarily large data set size due to high levels of redundancy, (ii) sequential access to these data sets and (iii) common bottlenecks in analysis times. The resulting novel data structure horizontally divides the data to circumvent traditional problems associated with the use of databases for very large genomic data sets. The resulting data set required 86% less disk space and performed analytical calculations 6248 times faster compared to a standard approach without any loss of information

Crossref

PubMed Central

Analyzing very large time series using suffix arrays

Author: A Crauser
B Phoophakdee
C-F Cheung
D Bao
D Gusfield
EMM Creight
F Rasheed
F Rasheed
Konstantinos F. Xylogiannopoulos
M Barsky
MG Elfeky
P Ko
Panagiotis Karampelas
R Dementiev
Reda Alhajj
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

MIRAGE: A Framework for Mining, Exploring and Visualizing Minimal Association Rules

Author: Benjarath Phoophakdee
Mohammed J. Zaki
Publication venue
Publication date
Field of study

In this paper we propose the concept of minimal association rules, the most general rules that satisfy a given support and confidence threshold. We present MIRAGE, an new framework for mining and visually exploring the minimal rules. MIRAGE uses lattice-based interactive rule visualization approach, displaying the rules in a very compact form; all association rules can also be generated if desired. MIRAGE uses a database back-end to store the state of exploration for easy retrieval at a later point in time

CiteSeerX

ABSTRACT Genome-scale Disk-based Suffix Tree Indexing

Author: Benjarath Phoophakdee
Mohammed J. Zaki
Publication venue
Publication date
Field of study

With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called Trellis which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. Trellis was compared to various stateof-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time

CiteSeerX

TRELLIS+: AN EFFECTIVE APPROACH FOR INDEXING GENOME-SCALE SEQUENCES USING SUFFIX TREES ∗

Author: Benjarath Phoophakdee
J. Zaki
Mohammed
Publication venue
Publication date
Field of study

With advances in high-throughput sequencing methods, and the corresponding exponential growth in sequence data, it has become critical to develop scalable data management techniques for sequence storage, retrieval and analysis. In this paper we present a novel disk-based suffix tree approach, called Trellis+, that effectively scales to massive amount of sequence data using only a limited amount of main-memory, based on a novel string buffering strategy. We show experimentally that Trellis+ outperforms existing suffix tree approaches; it is able to index genome-scale sequences (e.g., the entire Human genome), and it also allows rapid query processing over the disk-based index. Availability: TRELLIS+ source code is available online a

CiteSeerX

Parallel Continuous Flow: A Parallel Suffix Tree Construction Tool for Whole Genomes

Author: Apostolico A.
Gropp W.
Hariharan R.
Matteo Comin
Montse Farreras
Phoophakdee B.
Tata S.
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

Generic Pattern Mining via Data Mining Template Library

Author: Benjarath Phoophakdee
Feng Gao
Jeevan Pathuri
Joe Urban
Mohammed J. Zaki
Nagender Parimi
Nilanjana De
Paolo Palmerini
Publication venue
Publication date
Field of study

Frequent Pattern Mining (FPM) is a very powerful paradigm for mining informative and useful patterns in massive, complex datasets. In this paper we propose the Data Mining Template Library, a collection of generic containers and algorithms for data mining, as well as persistency and database management classes. DMTL provides a systematic solution to a whole class of common FPM tasks like itemset, sequence, tree and graph mining. DMTL is extensible, scalable, and high-performance for rapid response on massive datasets. A detailed set of experiments show that DMTL is competitive with special purpose algorithms designed for a particular pattern type, especially as database sizes increase

CiteSeerX

A simple parallel cartesian tree algorithm and its application to parallel suffix tree construction

Author: Blelloch G. E.
Farach M.
Gusfield D.
Iliopoulos C.
Jaja J.
Kasai T.
Meek C.
Phoophakdee B.
Poon C. K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref