Search CORE

14 research outputs found

Computing regularities in strings

Author: Smyth William
Yusufu M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities

Crossref

Research Repository

espace@Curtin

Searching of gapped repeats and subrepetitions in a word

Author: D. Gusfield
G. Brodal
J. Storer
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
P. Emde Boas van
R. Kolpakov
R. Kolpakov
R. Kolpakov
T. Kociumaka
Z. Galil
Publication venue
Publication date: 29/09/2013
Field of study

A gapped repeat is a factor of the form

uvu

where

u

and

v

are nonempty words. The period of the gapped repeat is defined as

|u|+|v|

. The gapped repeat is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its period. The gapped repeat is called

\alpha

-gapped if its period is not greater than

\alpha |v|

. A

\delta

-subrepetition is a factor which exponent is less than 2 but is not less than

1+\delta

(the exponent of the factor is the quotient of the length and the minimal period of the factor). The

\delta

-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. We reveal a close relation between maximal gapped repeats and maximal subrepetitions. Moreover, we show that in a word of length

n

the number of maximal

\alpha

-gapped repeats is bounded by

O(\alpha^2n)

and the number of maximal

\delta

-subrepetitions is bounded by

O(n/\delta^2)

. Using the obtained upper bounds, we propose algorithms for finding all maximal

\alpha

-gapped repeats and all maximal

\delta

-subrepetitions in a word of length

n

. The algorithm for finding all maximal

\alpha

-gapped repeats has

O(\alpha^2n)

time complexity for the case of constant alphabet size and

O(n\log n + \alpha^2n)

time complexity for the general case. For finding all maximal

\delta

-subrepetitions we propose two algorithms. The first algorithm has

O(\frac{n\log\log n}{\delta^2})

time complexity for the case of constant alphabet size and

O(n\log n +\frac{n\log\log n}{\delta^2})

time complexity for the general case. The second algorithm has

O(n\log n+\frac{n}{\delta^2}\log \frac{1}{\delta})

expected time complexity

arXiv.org e-Print Archive

Crossref

CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats

Author: A Bolotin
A van Belkum
C Pourcel
Charles Bland
D Gusfield
D Gusfield
DH Haft
DW Ussery
EPC Rocha
Fareedah Sabree
FJ Mojica
FJ Mojica
G Achaz
GA Benson
JA Shapiro
JP Schmidt
JS Godde
KS Makarova
Kyndall Brown
M Dsouza
M Hofnung
M.-F Sagot
Micheal Lowe
Nikos C Kyrpides
Philip Hugenholtz
R Jansen
RS Boyer
S Kurtz
SB Needleman
SK Kannan
Teresa L Ramsey
TF Smith
Publication venue: BioMed Central
Publication date: 01/05/2007
Field of study

Abstract Background Clustered Regularly Interspaced Palindromic Repeats (CRISPRs) are a novel type of direct repeat found in a wide range of bacteria and archaea. CRISPRs are beginning to attract attention because of their proposed mechanism; that is, defending their hosts against invading extrachromosomal elements such as viruses. Existing repeat detection tools do a poor job of identifying CRISPRs due to the presence of unique spacer sequences separating the repeats. In this study, a new tool, CRT, is introduced that rapidly and accurately identifies CRISPRs in large DNA strings, such as genomes and metagenomes. Results CRT was compared to CRISPR detection tools, Patscan and Pilercr. In terms of correctness, CRT was shown to be very reliable, demonstrating significant improvements over Patscan for measures precision, recall and quality. When compared to Pilercr, CRT showed improved performance for recall and quality. In terms of speed, CRT proved to be a huge improvement over Patscan. Both CRT and Pilercr were comparable in speed, however CRT was faster for genomes containing large numbers of repeats. Conclusion In this paper a new tool was introduced for the automatic detection of CRISPR elements. This tool, CRT, showed some important improvements over current techniques for CRISPR identification. CRT's approach to detecting repetitive sequences is straightforward. It uses a simple sequential scan of a DNA sequence and detects repeats directly without any major conversion or preprocessing of the input. This leads to a program that is easy to describe and understand; yet it is very accurate, fast and memory efficient, being O(<it>n</it>) in space and O(<it>nm</it>/<it>l</it>) in time.</p

Crossref

DigitalCommons@University of Nebraska

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

UNT Digital Library

Browsing repeats in genomes: Pygram and an application to non-coding region analysis

Author: Durand Patrick
Mahé Frédéric
Nicolas Jacques
Valin Anne-Sophie
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: A large number of studies on genome sequences have revealed the major role played by repeated sequences in the structure, function, dynamics and evolution of genomes. In-depth repeat analysis requires specialized methods, including visualization techniques, to achieve optimum exploratory power. RESULTS: This article presents Pygram, a new visualization application for investigating the organization of repeated sequences in complete genome sequences. The application projects data from a repeat index file on the analysed sequences, and by combining this principle with a query system, is capable of locating repeated sequences with specific properties. In short, Pygram provides an efficient, graphical browser for studying repeats. Implementation of the complete configuration is illustrated in an analysis of CRISPR structures in Archaea genomes and the detection of horizontal transfer between Archaea and Viruses. CONCLUSION: By proposing a new visualization environment to analyse repeated sequences, this application aims to increase the efficiency of laboratories involved in investigating repeat organization in single genomes or across several genomes

HAL-CentraleSupelec

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

HAL-INSU

HAL-Rennes 1

Counting Maximal-Exponent Factors in Words

Author: Badkobeh
Badkobeh
Bannai
Bannai
Bell
Brodal
Böckenhauer
Crochemore
Crochemore
Crochemore
Dumitran
Fischer
Gawrychowski
Golnaz Badkobeh
Gusfield
Iliopoulos
Kolpakov
Kolpakov
Maxime Crochemore
Robert Mercaş
Rytter
Tanimura
Thue
Publication venue: 'Elsevier BV'
Publication date: 02/03/2016
Field of study

This article shows tight upper and lower bounds on the number of occurrences of maximal-exponent factors occurring in a word

Goldsmiths Research Online

Crossref

King's Research Portal

Computing regularities in strings: A survey

Author: Smyth W.F.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2013
Field of study

The aim of this survey is to provide insight into the sequential algorithms that have been proposed to compute exact “regularities” in strings; that is, covers (or quasiperiods), seeds, repetitions, runs (or maximal periodicities), and repeats. After outlining and evaluating the algorithms that have been proposed for their computation, I suggest possibly productive future directions of research

Elsevier - Publisher Connector

Research Repository