14 research outputs found

    Computing regularities in strings

    Get PDF
    Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities

    Searching of gapped repeats and subrepetitions in a word

    Full text link
    A gapped repeat is a factor of the form uvuuvu where uu and vv are nonempty words. The period of the gapped repeat is defined as u+v|u|+|v|. The gapped repeat is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its period. The gapped repeat is called α\alpha-gapped if its period is not greater than αv\alpha |v|. A δ\delta-subrepetition is a factor which exponent is less than 2 but is not less than 1+δ1+\delta (the exponent of the factor is the quotient of the length and the minimal period of the factor). The δ\delta-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. We reveal a close relation between maximal gapped repeats and maximal subrepetitions. Moreover, we show that in a word of length nn the number of maximal α\alpha-gapped repeats is bounded by O(α2n)O(\alpha^2n) and the number of maximal δ\delta-subrepetitions is bounded by O(n/δ2)O(n/\delta^2). Using the obtained upper bounds, we propose algorithms for finding all maximal α\alpha-gapped repeats and all maximal δ\delta-subrepetitions in a word of length nn. The algorithm for finding all maximal α\alpha-gapped repeats has O(α2n)O(\alpha^2n) time complexity for the case of constant alphabet size and O(nlogn+α2n)O(n\log n + \alpha^2n) time complexity for the general case. For finding all maximal δ\delta-subrepetitions we propose two algorithms. The first algorithm has O(nloglognδ2)O(\frac{n\log\log n}{\delta^2}) time complexity for the case of constant alphabet size and O(nlogn+nloglognδ2)O(n\log n +\frac{n\log\log n}{\delta^2}) time complexity for the general case. The second algorithm has O(nlogn+nδ2log1δ)O(n\log n+\frac{n}{\delta^2}\log \frac{1}{\delta}) expected time complexity

    CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic repeats

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Clustered Regularly Interspaced Palindromic Repeats (CRISPRs) are a novel type of direct repeat found in a wide range of bacteria and archaea. CRISPRs are beginning to attract attention because of their proposed mechanism; that is, defending their hosts against invading extrachromosomal elements such as viruses. Existing repeat detection tools do a poor job of identifying CRISPRs due to the presence of unique spacer sequences separating the repeats. In this study, a new tool, CRT, is introduced that rapidly and accurately identifies CRISPRs in large DNA strings, such as genomes and metagenomes.</p> <p>Results</p> <p>CRT was compared to CRISPR detection tools, Patscan and Pilercr. In terms of correctness, CRT was shown to be very reliable, demonstrating significant improvements over Patscan for measures precision, recall and quality. When compared to Pilercr, CRT showed improved performance for recall and quality. In terms of speed, CRT proved to be a huge improvement over Patscan. Both CRT and Pilercr were comparable in speed, however CRT was faster for genomes containing large numbers of repeats.</p> <p>Conclusion</p> <p>In this paper a new tool was introduced for the automatic detection of CRISPR elements. This tool, CRT, showed some important improvements over current techniques for CRISPR identification. CRT's approach to detecting repetitive sequences is straightforward. It uses a simple sequential scan of a DNA sequence and detects repeats directly without any major conversion or preprocessing of the input. This leads to a program that is easy to describe and understand; yet it is very accurate, fast and memory efficient, being O(<it>n</it>) in space and O(<it>nm</it>/<it>l</it>) in time.</p

    Browsing repeats in genomes: Pygram and an application to non-coding region analysis

    Get PDF
    BACKGROUND: A large number of studies on genome sequences have revealed the major role played by repeated sequences in the structure, function, dynamics and evolution of genomes. In-depth repeat analysis requires specialized methods, including visualization techniques, to achieve optimum exploratory power. RESULTS: This article presents Pygram, a new visualization application for investigating the organization of repeated sequences in complete genome sequences. The application projects data from a repeat index file on the analysed sequences, and by combining this principle with a query system, is capable of locating repeated sequences with specific properties. In short, Pygram provides an efficient, graphical browser for studying repeats. Implementation of the complete configuration is illustrated in an analysis of CRISPR structures in Archaea genomes and the detection of horizontal transfer between Archaea and Viruses. CONCLUSION: By proposing a new visualization environment to analyse repeated sequences, this application aims to increase the efficiency of laboratories involved in investigating repeat organization in single genomes or across several genomes

    Counting Maximal-Exponent Factors in Words

    Get PDF
    This article shows tight upper and lower bounds on the number of occurrences of maximal-exponent factors occurring in a word

    Computing regularities in strings: A survey

    Get PDF
    The aim of this survey is to provide insight into the sequential algorithms that have been proposed to compute exact “regularities” in strings; that is, covers (or quasiperiods), seeds, repetitions, runs (or maximal periodicities), and repeats. After outlining and evaluating the algorithms that have been proposed for their computation, I suggest possibly productive future directions of research
    corecore