14 research outputs found
Large-scale methods in computational genomics
The explosive growth in biological sequence data coupled with the design and deployment of increasingly high throughput sequencing technologies has created a need for methods capable of processing large-scale sequence data in a time and cost effective manner. In this dissertation, we address this need through the development of faster algorithms, space-efficient methods, and high-performance parallel computing techniques for some key problems in computational genomics;The first problem addressed is the clustering of DNA sequences based on a measure of sequence similarity. Our clustering method: (i) guarantees linear space complexity, in contrast to the quadratic memory requirements of previously developed methods; (ii) identifies sequence pairs containing long maximal matches in the decreasing order of their maximal match lengths in run-time proportional to the sum of input and output sizes; (iii) provides heuristics to significantly reduce the number of pairs evaluated for checking sequence similarity without affecting quality; and (iv) has parallel strategies that provide linear speedup and a proportionate reduction in space per processor. Our approach has significantly enhanced the problem size reach while also drastically reducing the time to solution;The next problem we address is the de novo detection of genomic repeats called Long Terminal Repeat (LTR) retrotransposons. Our algorithm guarantees linear space complexity and produces high quality candidates for prediction in run-time proportional to the sum of input and output sizes. Validation of our approach on the yeast genome demonstrates both superior quality and performance results when compared to previously developed software;In a genome assembly project, fragments sequenced from a target genome are computationally assembled into numerous supersequences called contigs , which are then ordered and oriented into scaffolds . In this dissertation, we introduce a new problem called retroscaffolding for scaffolding contigs based on the knowledge of their LTR retrotransposon content. Through identification of sequencing gaps that span LTR retrotransposons, retroscaffolding provides a mechanism for prioritizing sequencing gaps for finishing purposes;While most of the problems addressed here have been studied previously, the main contribution in this dissertation is the development of methods that can scale to the largest available sequence collections
Parallel clustering of expressed sequence tags
Expressed sequence tags, abbreviated ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition, understanding important genetic variations such as those resulting in diseases and removing redundancies in gene indices. Currently, the software programs that are mostly widely used for EST clustering are those that are developed for solving the related problem of fragment assembly. Due to the differences in the nature of the problems and the input the fragment assembly programs are not an ideal match for clustering large EST data sets. In this thesis, we present the design and development of a parallel software system that targets large-scale EST clustering. The novel features of our approach include 1) design of space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 144,870 Arabidopsis ESTs in 9.5 minutes on a 64-processor IBM xSeries cluster with 512 MB memory per processor, a problem that does not execute on 512 MB due to insufficient memory using CAP3, a state-of-the-art fragment assembly sequential software and takes 247 minutes to run when the memory is increased to 1 GB. We also clustered 327,632 rat ESTs in 47 minutes on 64 processors with 512 MB memory per processor
Parallel clustering of expressed sequence tags
Expressed sequence tags, abbreviated ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition, understanding important genetic variations such as those resulting in diseases and removing redundancies in gene indices. Currently, the software programs that are mostly widely used for EST clustering are those that are developed for solving the related problem of fragment assembly. Due to the differences in the nature of the problems and the input the fragment assembly programs are not an ideal match for clustering large EST data sets. In this thesis, we present the design and development of a parallel software system that targets large-scale EST clustering. The novel features of our approach include 1) design of space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 144,870 Arabidopsis ESTs in 9.5 minutes on a 64-processor IBM xSeries cluster with 512 MB memory per processor, a problem that does not execute on 512 MB due to insufficient memory using CAP3, a state-of-the-art fragment assembly sequential software and takes 247 minutes to run when the memory is increased to 1 GB. We also clustered 327,632 rat ESTs in 47 minutes on 64 processors with 512 MB memory per processor.</p
Large-scale methods in computational genomics
The explosive growth in biological sequence data coupled with the design and deployment of increasingly high throughput sequencing technologies has created a need for methods capable of processing large-scale sequence data in a time and cost effective manner. In this dissertation, we address this need through the development of faster algorithms, space-efficient methods, and high-performance parallel computing techniques for some key problems in computational genomics;The first problem addressed is the clustering of DNA sequences based on a measure of sequence similarity. Our clustering method: (i) guarantees linear space complexity, in contrast to the quadratic memory requirements of previously developed methods; (ii) identifies sequence pairs containing long maximal matches in the decreasing order of their maximal match lengths in run-time proportional to the sum of input and output sizes; (iii) provides heuristics to significantly reduce the number of pairs evaluated for checking sequence similarity without affecting quality; and (iv) has parallel strategies that provide linear speedup and a proportionate reduction in space per processor. Our approach has significantly enhanced the problem size reach while also drastically reducing the time to solution;The next problem we address is the de novo detection of genomic repeats called Long Terminal Repeat (LTR) retrotransposons. Our algorithm guarantees linear space complexity and produces high quality candidates for prediction in run-time proportional to the sum of input and output sizes. Validation of our approach on the yeast genome demonstrates both superior quality and performance results when compared to previously developed software;In a genome assembly project, fragments sequenced from a target genome are computationally assembled into numerous supersequences called "contigs", which are then ordered and oriented into "scaffolds". In this dissertation, we introduce a new problem called retroscaffolding for scaffolding contigs based on the knowledge of their LTR retrotransposon content. Through identification of sequencing gaps that span LTR retrotransposons, retroscaffolding provides a mechanism for prioritizing sequencing gaps for finishing purposes;While most of the problems addressed here have been studied previously, the main contribution in this dissertation is the development of methods that can scale to the largest available sequence collections.</p
Parallel EST Clustering
Expressed sequence tags, abbreviated ESTs, are DNA fragments experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and understanding important genetic variations such as those resulting in diseases. In this paper, we present the design and development of a parallel software system for EST clustering. The novel features of our approach include 1) space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 50,000 maize ESTs in 16 minutes on a 32-processor IBM SP. To our knowledge, this is the first effort in building a parallel software system for EST clustering
Efficient clustering of large EST data sets on parallel computers
Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for
P
arallel
C
lustering of
E
STs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using
Arabidopsis
ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200
Arabidopsis
ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694
Triticum aestivum
ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark
Arabidopsis
EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website
Atlas of the Radical SAM Superfamily: Divergent Evolution of Function Using a "Plug and Play" Domain.
The radical SAM superfamily contains over 100,000 homologous enzymes that catalyze a remarkably broad range of reactions required for life, including metabolism, nucleic acid modification, and biogenesis of cofactors. While the highly conserved SAM-binding motif responsible for formation of the key 5'-deoxyadenosyl radical intermediate is a key structural feature that simplifies identification of superfamily members, our understanding of their structure-function relationships is complicated by the modular nature of their structures, which exhibit varied and complex domain architectures. To gain new insight about these relationships, we classified the entire set of sequences into similarity-based subgroups that could be visualized using sequence similarity networks. This superfamily-wide analysis reveals important features that had not previously been appreciated from studies focused on one or a few members. Functional information mapped to the networks indicates which members have been experimentally or structurally characterized, their known reaction types, and their phylogenetic distribution. Despite the biological importance of radical SAM chemistry, the vast majority of superfamily members have never been experimentally characterized in any way, suggesting that many new reactions remain to be discovered. In addition to 20 subgroups with at least one known function, we identified additional subgroups made up entirely of sequences of unknown function. Importantly, our results indicate that even general reaction types fail to track well with our sequence similarity-based subgroupings, raising major challenges for function prediction for currently identified and new members that continue to be discovered. Interactive similarity networks and other data from this analysis are available from the Structure-Function Linkage Database
Recommended from our members
Genome sequence analysis of the model grass Brachypodium distachyon: insights into grass genome evolution
Three subfamilies of grasses, the Erhardtoideae (rice), the Panicoideae (maize, sorghum, sugar cane and millet), and the Pooideae (wheat, barley and cool season forage grasses) provide the basis of human nutrition and are poised to become major sources of renewable energy. Here we describe the complete genome sequence of the wild grass Brachypodium distachyon (Brachypodium), the first member of the Pooideae subfamily to be completely sequenced. Comparison of the Brachypodium, rice and sorghum genomes reveals a precise sequence- based history of genome evolution across a broad diversity of the grass family and identifies nested insertions of whole chromosomes into centromeric regions as a predominant mechanism driving chromosome evolution in the grasses. The relatively compact genome of Brachypodium is maintained by a balance of retroelement replication and loss. The complete genome sequence of Brachypodium, coupled to its exceptional promise as a model system for grass research, will support the development of new energy and food crop