BACKGROUND: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. RESULTS: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. CONCLUSIONS: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences

Haas, Brian J

Salzberg, Steven L

Volfovsky, Natalia

English

PubMed

Background: A computational system for analysis of the repetitive structure of genomic
sequences is described. The method uses suffix trees to organize and search the input sequences;
this data structure has been used previously for efficient computation of exact and degenerate
repeats.
Results: The resulting software tool collects all repeat classes and outputs summary statistics as
well as a file containing multiple sequences (multi fasta), that can be used as the target of searches.
Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis
thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences.
Conclusions: We propose a new clustering method for analysis of the repeat data captured in suffix
trees. This method has been incorporated into a system that can find repeats in individual genome
sequences or sets of sequences, and that can organize those repeats into classes. It quickly and
accurately creates repeat databases from small and large genomes. The associated software
(RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial
genome sequences

Haas, Brian J.

Salzberg, Steven L.

Digital Repository at the University of Maryland

A clustering method for repeat analysis in DNA sequences.

AtRepBase [http://nucleus.cshl.org/protarab/AtRepBase.htm] 26. Arabidopsis gene sequence database [http://www.tigr.org/tdb/e2k1/ath1/ath1.shtml] 27. Yuan

C: REPuter - fast computation of maximal repeats in complete genomes. Bioinformatics

Complete genome sequence of Caulobacter crescentus.

Complete genome sequence of Neisseria meningitidis serogroup B strain MC58. Science

Complete genomic sequence of Treponema pallidum, the syphilis spirochete. Science

Computation and visualization of degenerate repeats in complete genomes.

et al.: Genome sequences of Chlamydia trachomatis MoPn and Chlamydia pneumoniae AR39. Nucleic Acids Res

Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima.

EW: An algorithm for locating nonoverlapping regions of maximal alignment score.

G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res

Gish W: MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics

of Medicine: Index of /blast/blast [http://blast.wustl.edu/blast] 11. Delcher

rice BAC end database [http://www.genome.clemson.edu/projects/rice/rice_bac_end] 24. Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. New York:

S: An efficient algorithm for identifying matches with errors in multiple long molecular sequences.

States DJ: Identification of protein coding regions by database similarity search. Nat Genet

The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature

The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature

The genome sequence of Drosophila melanogaster. Science

TIGR software tools [http://www.tigr.org/softlab/] 15. Bult

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=55324

A clustering method for repeat analysis in DNA sequences

A clustering method for repeat analysis in DNA sequences

Abstract

Similar works

Full text

Available Versions

Digital Repository at the University of Maryland