Search CORE

14 research outputs found

Large-scale methods in computational genomics

Author: Kalyanaraman Anantharaman
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2006
Field of study

The explosive growth in biological sequence data coupled with the design and deployment of increasingly high throughput sequencing technologies has created a need for methods capable of processing large-scale sequence data in a time and cost effective manner. In this dissertation, we address this need through the development of faster algorithms, space-efficient methods, and high-performance parallel computing techniques for some key problems in computational genomics;The first problem addressed is the clustering of DNA sequences based on a measure of sequence similarity. Our clustering method: (i) guarantees linear space complexity, in contrast to the quadratic memory requirements of previously developed methods; (ii) identifies sequence pairs containing long maximal matches in the decreasing order of their maximal match lengths in run-time proportional to the sum of input and output sizes; (iii) provides heuristics to significantly reduce the number of pairs evaluated for checking sequence similarity without affecting quality; and (iv) has parallel strategies that provide linear speedup and a proportionate reduction in space per processor. Our approach has significantly enhanced the problem size reach while also drastically reducing the time to solution;The next problem we address is the de novo detection of genomic repeats called Long Terminal Repeat (LTR) retrotransposons. Our algorithm guarantees linear space complexity and produces high quality candidates for prediction in run-time proportional to the sum of input and output sizes. Validation of our approach on the yeast genome demonstrates both superior quality and performance results when compared to previously developed software;In a genome assembly project, fragments sequenced from a target genome are computationally assembled into numerous supersequences called contigs , which are then ordered and oriented into scaffolds . In this dissertation, we introduce a new problem called retroscaffolding for scaffolding contigs based on the knowledge of their LTR retrotransposon content. Through identification of sequencing gaps that span LTR retrotransposons, retroscaffolding provides a mechanism for prioritizing sequencing gaps for finishing purposes;While most of the problems addressed here have been studied previously, the main contribution in this dissertation is the development of methods that can scale to the largest available sequence collections

Digital Repository @ Iowa State University (ISU)

Parallel clustering of expressed sequence tags

Author: Kalyanaraman Anantharaman
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2002
Field of study

Expressed sequence tags, abbreviated ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition, understanding important genetic variations such as those resulting in diseases and removing redundancies in gene indices. Currently, the software programs that are mostly widely used for EST clustering are those that are developed for solving the related problem of fragment assembly. Due to the differences in the nature of the problems and the input the fragment assembly programs are not an ideal match for clustering large EST data sets. In this thesis, we present the design and development of a parallel software system that targets large-scale EST clustering. The novel features of our approach include 1) design of space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 144,870 Arabidopsis ESTs in 9.5 minutes on a 64-processor IBM xSeries cluster with 512 MB memory per processor, a problem that does not execute on 512 MB due to insufficient memory using CAP3, a state-of-the-art fragment assembly sequential software and takes 247 minutes to run when the memory is increased to 1 GB. We also clustered 327,632 rat ESTs in 47 minutes on 64 processors with 512 MB memory per processor

Digital Repository @ Iowa State University (ISU)

Parallel clustering of expressed sequence tags

Author: Kalyanaraman Anantharaman
Publication venue
Publication date: 01/01/2002
Field of study

Digital Repository @ Iowa State University (ISU)

Large-scale methods in computational genomics

Author: Kalyanaraman Anantharaman
Publication venue
Publication date: 01/01/2006
Field of study

The explosive growth in biological sequence data coupled with the design and deployment of increasingly high throughput sequencing technologies has created a need for methods capable of processing large-scale sequence data in a time and cost effective manner. In this dissertation, we address this need through the development of faster algorithms, space-efficient methods, and high-performance parallel computing techniques for some key problems in computational genomics;The first problem addressed is the clustering of DNA sequences based on a measure of sequence similarity. Our clustering method: (i) guarantees linear space complexity, in contrast to the quadratic memory requirements of previously developed methods; (ii) identifies sequence pairs containing long maximal matches in the decreasing order of their maximal match lengths in run-time proportional to the sum of input and output sizes; (iii) provides heuristics to significantly reduce the number of pairs evaluated for checking sequence similarity without affecting quality; and (iv) has parallel strategies that provide linear speedup and a proportionate reduction in space per processor. Our approach has significantly enhanced the problem size reach while also drastically reducing the time to solution;The next problem we address is the de novo detection of genomic repeats called Long Terminal Repeat (LTR) retrotransposons. Our algorithm guarantees linear space complexity and produces high quality candidates for prediction in run-time proportional to the sum of input and output sizes. Validation of our approach on the yeast genome demonstrates both superior quality and performance results when compared to previously developed software;In a genome assembly project, fragments sequenced from a target genome are computationally assembled into numerous supersequences called "contigs", which are then ordered and oriented into "scaffolds". In this dissertation, we introduce a new problem called retroscaffolding for scaffolding contigs based on the knowledge of their LTR retrotransposon content. Through identification of sequencing gaps that span LTR retrotransposons, retroscaffolding provides a mechanism for prioritizing sequencing gaps for finishing purposes;While most of the problems addressed here have been studied previously, the main contribution in this dissertation is the development of methods that can scale to the largest available sequence collections.</p

Digital Repository @ Iowa State University (ISU)

Parallel EST Clustering

Author: Anantharaman Kalyanaraman
Srinivas Aluru
Suresh Kothari
Publication venue
Publication date: 01/01/2002
Field of study

Expressed sequence tags, abbreviated ESTs, are DNA fragments experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and understanding important genetic variations such as those resulting in diseases. In this paper, we present the design and development of a parallel software system for EST clustering. The novel features of our approach include 1) space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 50,000 maize ESTs in 16 minutes on a 32-processor IBM SP. To our knowledge, this is the first effort in building a parallel software system for EST clustering

CiteSeerX

Crossref

Efficient clustering of large EST data sets on parallel computers

Author: Aluru Srinivas
Brendel Volker
Kalyanaraman Anantharaman
Kothari Suresh
Publication venue: Oxford, UK
Publication date: 01/01/2003
Field of study

Clustering expressed sequence tags (ESTs) is a powerful strategy for gene identification, gene expression studies and identifying important genetic variations such as single nucleotide polymorphisms. To enable fast clustering of large-scale EST data, we developed PaCE (for P arallel C lustering of E STs), a software program for EST clustering on parallel computers. In this paper, we report on the design and development of PaCE and its evaluation using Arabidopsis ESTs. The novel features of our approach include: (i) design of memory efficient algorithms to reduce the memory required to linear in the size of the input, (ii) a combination of algorithmic techniques to reduce the computational work without sacrificing the quality of clustering, and (iii) use of parallel processing to reduce run-time and facilitate clustering of larger data sets. Using a combination of these techniques, we report the clustering of 168 200 Arabidopsis ESTs in 15 min on an IBM xSeries cluster with 30 dual-processor nodes. We also clustered 327 632 rat ESTs in 47 min and 420 694 Triticum aestivum ESTs in 3 h and 15 min. We demonstrate the quality of our software using benchmark Arabidopsis EST data, and by comparing it with CAP3, a software widely used for EST assembly. Our software allows clustering of much larger EST data sets than is possible with current software. Because of its speed, it also facilitates multiple runs with different parameters, providing biologists a tool to better analyze EST sequence data. Using PaCE, we clustered EST data from 23 plant species and the results are available at the PlantGDB website

CiteSeerX

PubMed Central

Washington State University institutional repository

Atlas of the Radical SAM Superfamily: Divergent Evolution of Function Using a "Plug and Play" Domain.

Author: Akiva
Altschul
Anantharaman
Ashburner
Atkinson
Babbitt
Baker
Barber
Barr
Benjdia
Berman
Betz
Blaszczyk
Booker
Broderick
Brown
Brown
Burroughs
Calhoun
Cicchillo
Cicchillo
Coquille
Dawson
de Beer
Dinis
Dowling
Eddy
Finn
Finn
Furnham
Gerlt
Gizzi
Grell
Hanzelmann
Hermann
Hiratsuka
Holliday
Holliday
Holliday
Kalyanaraman
Kamat
Knappe
LaMattina
Lanz
Laskowski
Lee
Li
Lotierzo
Mahanta
Mahanta
Mancia
Marcotte
Mashiyama
Miller
Moss
Nicolet
Padovani
Pierre
Pilet
Puehringer
Radivojac
Rahman
Reyda
Schnoes
Shannon
Sievers
Smoot
Sofia
Tamuri
Tao
Tipton
UniProt Consortium
Vey
Vey
Wang
Yang
Young
Yu
Zhang
Zhao
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

The radical SAM superfamily contains over 100,000 homologous enzymes that catalyze a remarkably broad range of reactions required for life, including metabolism, nucleic acid modification, and biogenesis of cofactors. While the highly conserved SAM-binding motif responsible for formation of the key 5'-deoxyadenosyl radical intermediate is a key structural feature that simplifies identification of superfamily members, our understanding of their structure-function relationships is complicated by the modular nature of their structures, which exhibit varied and complex domain architectures. To gain new insight about these relationships, we classified the entire set of sequences into similarity-based subgroups that could be visualized using sequence similarity networks. This superfamily-wide analysis reveals important features that had not previously been appreciated from studies focused on one or a few members. Functional information mapped to the networks indicates which members have been experimentally or structurally characterized, their known reaction types, and their phylogenetic distribution. Despite the biological importance of radical SAM chemistry, the vast majority of superfamily members have never been experimentally characterized in any way, suggesting that many new reactions remain to be discovered. In addition to 20 subgroups with at least one known function, we identified additional subgroups made up entirely of sequences of unknown function. Importantly, our results indicate that even general reaction types fail to track well with our sequence similarity-based subgroupings, raising major challenges for function prediction for currently identified and new members that continue to be discovered. Interactive similarity networks and other data from this analysis are available from the Structure-Function Linkage Database

Crossref

eScholarship - University of California

Recommended from our members

Genome sequence analysis of the model grass Brachypodium distachyon: insights into grass genome evolution

Author: Abrouk Michael
Anderson Olin D.
Barbazuk Brad
Barry Kerrie
Bartley Laura E.
Baxter Ivan
Belcram Harry
Bevan Michael
Bevan Michael
Bragg Jennifer N.
Bryant Douglas W.
Buchmann Jan P.
Budak Hikmet
Byrne Mary E.
Cao Peijian
Carrington James C.
Cass Cynthia L.
Chalhoub Boulos
Chang Jeff H.
Chapman Elisabeth J.
Charles Mathieu
Dardick Christopher D.
Dvorak Jan
Fahlgren Noah
Febrer Melanie
Ganssmann Matthias
Garvin David F.
German Marcelo
Green Pamela J.
Grimwood Jane
Grotewold Erich
Gundlach Heidrun
H&#246
Haberer Georg
Harholt Jesper
Harmon Frank
Harmon-Smith Miranda
Heese Maren
Hematy Kian
Higgins Janet
Hsia An-Ping
Huo Naxin
Idziak Dominika
Inz&#233
Jiang Ning
Jung Ki-Hong
Kalyanaraman Anantharaman
Kimbrel Jeffrey A.
Lai Jinsheng
Lail Kathleen
Laudencia-Chingcuanco Debbie
Lazo Gerard R.
Lindquist Erika
Luo Ming-Cheng
Ma Jianxin
Maia Luciano da C.
May Greg D.
Mayer Klaus
McKenzie Neil
Messing Joachim
Meyers Blake C.
Mouille Gregory
Mueller Lukas A.
Murat Florent
O'Connor Devin
Oliveira Antonio Costa de
Pelloux J&#233
Priest Henry D.
Pritham Ellen
Rokhsar Dan
Ronald Pamela
Rose Jocelyn K. C.
Salse Jerome
Scheller Henrik V.
Schmutz Jeremy
Schnable James
Schnable Patrick S.
Schnittger Arp
Schulman Alan H.
Sedbrook John C.
Sharma Manoj K
Spannagl Manuel
Sullivan Christopher M.
Sun Cheng
Tanskanen Jaakko
Thomson James
Tice Hope
Tuskan Gerald A.
Tyler Ludmila
Ulvskov Peter
Vega-Sanchez Miguel
Vogel John P.
Wang Mei
Wright Jonathan
Wu Haiyan
Wu Jiajie
Yilmaz Alper
You Frank M.
Zhai Jixian
Zhu Liucun
Publication venue: eScholarship, University of California
Publication date: 09/12/2009
Field of study

Three subfamilies of grasses, the Erhardtoideae (rice), the Panicoideae (maize, sorghum, sugar cane and millet), and the Pooideae (wheat, barley and cool season forage grasses) provide the basis of human nutrition and are poised to become major sources of renewable energy. Here we describe the complete genome sequence of the wild grass Brachypodium distachyon (Brachypodium), the first member of the Pooideae subfamily to be completely sequenced. Comparison of the Brachypodium, rice and sorghum genomes reveals a precise sequence- based history of genome evolution across a broad diversity of the grass family and identifies nested insertions of whole chromosomes into centromeric regions as a predominant mechanism driving chromosome evolution in the grasses. The relatively compact genome of Brachypodium is maintained by a balance of retroelement replication and loss. The complete genome sequence of Brachypodium, coupled to its exceptional promise as a model system for grass research, will support the development of new energy and food crop

eScholarship - University of California