Search CORE

A clustering method for repeat analysis in DNA sequences

Author: Haas Brian J
Salzberg Steven L
Volfovsky Natalia
Publication venue: BioMed Central
Publication date: 01/01/2001
Field of study

BACKGROUND: A computational system for analysis of the repetitive structure of genomic sequences is described. The method uses suffix trees to organize and search the input sequences; this data structure has been used previously for efficient computation of exact and degenerate repeats. RESULTS: The resulting software tool collects all repeat classes and outputs summary statistics as well as a file containing multiple sequences (multi fasta), that can be used as the target of searches. Its use is demonstrated here on several complete microbial genomes, the entire Arabidopsis thaliana genome, and a large collection of rice bacterial artificial chromosome end sequences. CONCLUSIONS: We propose a new clustering method for analysis of the repeat data captured in suffix trees. This method has been incorporated into a system that can find repeats in individual genome sequences or sets of sequences, and that can organize those repeats into classes. It quickly and accurately creates repeat databases from small and large genomes. The associated software (RepeatFinder), should prove helpful in the analysis of repeat structure for both complete and partial genome sequences

CiteSeerX

Digital Repository at the University of Maryland

Bioinformatics: Strategies, Trends, and Perspectives

Author: Adriane Beatriz de Souza Serapião
Carlos Norberto Fischer
Publication venue: 'IntechOpen'
Publication date: 01/03/2010
Field of study

IntechOpen

Skittle: A 2-Dimensional Genome Visualization Tool

Author: Birney
D Sussillo
E Lieberman-Aiden
EN Trifonov
EN Trifonov
G Benson
GM Weinstock
GS Baldwin
I López-Villaseñor
J Sánchez
JF Canny
John C Sanford
Josiah D Seaman
M Costantini
MB Gerstein
MK Rudd
P Schieg
S Kurtz
X She
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background It is increasingly evident that there are multiple and overlapping patterns within the genome, and that these patterns contain different types of information - regarding both genome function and genome history. In order to discover additional genomic patterns which may have biological significance, novel strategies are required. To partially address this need, we introduce a new data visualization tool entitled Skittle. Results This program first creates a 2-dimensional nucleotide display by assigning four colors to the four nucleotides, and then text-wraps to a user adjustable width. This nucleotide display is accompanied by a "repeat map" which comprehensively displays all local repeating units, based upon analysis of all possible local alignments. Skittle includes a smooth-zooming interface which allows the user to analyze genomic patterns at any scale. Skittle is especially useful in identifying and analyzing tandem repeats, including repeats not normally detectable by other methods. However, Skittle is also more generally useful for analysis of any genomic data, allowing users to correlate published annotations and observable visual patterns, and allowing for sequence and construct quality control. Conclusions Preliminary observations using Skittle reveal intriguing genomic patterns not otherwise obvious, including structured variations inside tandem repeats. The striking visual patterns revealed by Skittle appear to be useful for hypothesis development, and have already led the authors to theorize that imperfect tandem repeats could act as information carriers, and may form tertiary structures within the interphase nucleus.</p

Springer - Publisher Connector

Recommended from our members

Genome Sequencing and Analysis of Yersina pestis KIM D27, an Avirulent Strain Exempt from Select Agent Regulation

Author: Durkin Scott
Hostetler Jessica
Kim Maria
Losada Liliana
Nierman William C.
Radune Diana
Schneewind Olaf
Varga John J.
Publication venue
Publication date: 25/01/2024
Field of study

Yersinia pestis is the causative agent of the plague. Y. pestis KIM 10+ strain was passaged and selected for loss of the 102 kb pgm locus, resulting in an attenuated strain, KIM D27. In this study, whole genome sequencing was performed on KIM D27 in order to identify any additional differences. Initial assemblies of 454 data were highly fragmented, and various bioinformatic tools detected between 15 and 465 SNPs and INDELs when comparing both strains, the vast majority associated with A or T homopolymer sequences. Consequently, Illumina sequencing was performed to improve the quality of the assembly. Hybrid sequence assemblies were performed and a total of 56 validated SNP/INDELs and 5 repeat differences were identified in the D27 strain relative to published KIM 10+ sequence. However, further analysis showed that 55 of these SNP/INDELs and 3 repeats were errors in the KIM 10+ reference sequence. We conclude that both 454 and Illumina sequencing were required to obtain the most accurate and rapid sequence results for Y. pestis KIMD27. SNP and INDELS calls were most accurate when both Newbler and CLC Genomics Workbench were employed. For purposes of obtaining high quality genome sequence differences between strains, any identified differences should be verified in both the new and reference genomes.</p

Knowledge UChicago

Genome Sequencing and Analysis of Yersina pestis KIM D27, an Avirulent Strain Exempt from Select Agent Regulation

Author: Durkin Scott
Hostetler Jessica
Kim Maria
Losada Liliana
Nierman William C.
Radune Diana
Schneewind Olaf
Varga John J.
Publication venue: Public Library of Science
Publication date: 01/04/2011
Field of study

DNA Data Visualization (DDV): Software for Generating Web-Based Interfaces Supporting Navigation and Analysis of DNA Sequence Data of Entire Genomes

Author: Bordeleau Eric
Brzezinski Ryszard
Burrus Vincent
Neugebauer Tomasz
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Data visualization methods are necessary during the exploration and analysis activities of an increasingly data-intensive scientific process. There are few existing visualization methods for raw nucleotide sequences of a whole genome or chromosome. Software for data visualization should allow the researchers to create accessible data visualization interfaces that can be exported and shared with others on the web. Herein, novel software developed for generating DNA data visualization interfaces is described. The software converts DNA data sets into images that are further processed as multi-scale images to be accessed through a web-based interface that supports zooming, panning and sequence fragment selection. Nucleotide composition frequencies and GC skew of a selected sequence segment can be obtained through the interface. The software was used to generate DNA data visualization of human and bacterial chromosomes. Examples of visually detectable features such as short and long direct repeats, long terminal repeats, mobile genetic elements, heterochromatic segments in microbial and human chromosomes, are presented. The software and its source code are available for download and further development. The visualization interfaces generated with the software allow for the immediate identification and observation of several types of sequence patterns in genomes of various sizes and origins. The visualization interfaces generated with the software are readily accessible through a web browser. This software is a useful research and teaching tool for genetics and structural genomics

Concordia University Research Repository

Masking repeats while clustering ESTs

Author: Coward Eivind
Jonassen Inge
Malde Ketil
Schneeberger Korbinian
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

A problem in EST clustering is the presence of repeat sequences. To avoid false matches, repeats have to be masked. This can be a time-consuming process, and it depends on available repeat libraries. We present a fast and effective method that aims to eliminate the problems repeats cause in the process of clustering. Unlike traditional methods, repeats are inferred directly from the EST data, we do not rely on any external library of known repeats. This makes the method especially suitable for analysing the ESTs from organisms without good repeat libraries. We demonstrate that the result is very similar to performing standard repeat masking before clustering

CiteSeerX

Public Library of Science (PLOS)

Local Gene Regulation Details a Recognition Code within the LacI Transcriptional Factor Family

Author: A Glasfeld
A Sandelin
A Sarai
A Ureta-Vidal
AE Kazakov
AV Morozov
BM Hall
BW Matthews
C Francke
CE Bell
CG Kalodimos
CI Jørgensen
CO Pabo
CO Pabo
EJ Alm
Eric J. Alm
FM Camas
Francisco M. Camas
G Kolesov
G Paillard
Gary D. Stormo
GP Smith
J Boch
J Castresana
J Nardelli
J Sartorius
J Schultz
JL Betz
JO Korbel
JR Desjarlais
Juan F. Poyatos
L Milk
M Lewis
M Lewis
M Lewis
M Perros
M Suzuki
MA Schumacher
MA Schumacher
MJ Moscou
MJ Weickert
MM Gromiha
NC Seeman
NM Luscombe
P Baldi
PB Warren
PV Benos
R Hershberg
RC Edgar
RK Salinas
S Mahony
S Mahony
SA Wolfe
SJ Maerlk
T Sera
TA Desai
V Espinosa Angarica
W Thompson
WW Wasserman
Y Choo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2010
Field of study

The specific binding of regulatory proteins to DNA sequences exhibits no clear patterns of association between amino acids (AAs) and nucleotides (NTs). This complexity of protein-DNA interactions raises the question of whether a simple set of wide-coverage recognition rules can ever be identified. Here, we analyzed this issue using the extensive LacI family of transcriptional factors (TFs). We searched for recognition patterns by introducing a new approach to phylogenetic footprinting, based on the pervasive presence of local regulation in prokaryotic transcriptional networks. We identified a set of specificity correlations –determined by two AAs of the TFs and two NTs in the binding sites– that is conserved throughout a dominant subgroup within the family regardless of the evolutionary distance, and that act as a relatively consistent recognition code. The proposed rules are confirmed with data of previous experimental studies and by events of convergent evolution in the phylogenetic tree. The presence of a code emphasizes the stable structural context of the LacI family, while defining a precise blueprint to reprogram TF specificity with many practical applications.Ministerio de Ciencia e Innovación, Spain (Formación de Profesorado Universitario fellowship)Ministerio de Ciencia e Innovación, Spain (grant BFU2008-03632/BMC)Madrid (Spain : Region) (grant CCG08-CSIC/SAL-3651

CiteSeerX

DSpace@MIT