6 research outputs found

    IUPACpal: efficient identification of inverted repeats in IUPAC-encoded DNA sequences

    Get PDF
    Background: An inverted repeat is a DNA sequence followed downstream by its reverse complement, potentially with a gap in the centre. Inverted repeats are found in both prokaryotic and eukaryotic genomes and they have been linked with countless possible functions. Many international consortia provide a comprehensive description of common genetic variation making alternative sequence representations, such as IUPAC encoding, necessary for leveraging the full potential of such broad variation datasets. Results: We present IUPACpal, an exact tool for efficient identification of inverted repeats in IUPAC-encoded DNA sequences allowing also for potential mismatches and gaps in the inverted repeats. Conclusion: Within the parameters that were tested, our experimental results show that IUPACpal compares favourably to a similar application packaged with EMBOSS. We show that IUPACpal identifies many previously unidentified inverted repeats when compared with EMBOSS, and that this is also performed with orders of magnitude improved speed.</p

    Efficient computation of sequence mappability

    Get PDF
    Sequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices j≠ i such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k= 1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k= O(1) , works in O(n) space and, with high probability, in O(n· min { mk, log kn}) time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n2) -time algorithms to compute all (k, m)-mappability tables for a fixed m and all k∈ { 0 , … , m} or a fixed k and all m∈ { k, … , n}. Finally, we show that, for k, m= Θ (log n) , the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018

    Linear-time computation of cyclic roots and cyclic covers of a string

    Get PDF
    Cyclic versions of covers and roots of a string are considered in this paper. A prefix V of a string S is a cyclic root of S if S is a concatenation of cyclic rotations of V . A prefix V of S is a cyclic cover of S if the occurrences of the cyclic rotations of V cover all positions of S. We present O(n)-time algorithms computing all cyclic roots (using number-theoretic tools) and all cyclic covers (using tools related to seeds) of a length-n string over an integer alphabet. Our results improve upon O(n log log n) and O(n log n) time complexities of recent algorithms of Grossi et al. (WALCOM 2023) for the respective problems and provide novel approaches to the problems. As a by-product, we obtain an optimal data structure for Internal Circular Pattern Matching queries that generalize Internal Pattern Matching and Cyclic Equivalence queries of Kociumaka et al. (SODA 2015)

    Comparing Degenerate Strings

    Get PDF
    Uncertain sequences are compact representations of sets of similar strings. They highlight common segments by collapsing them, and explicitly represent varying segments by listing all possible options. A generalized degenerate string (GD string) is a type of uncertain sequence. Formally, a GD string S is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote by W the sum of these lengths k0, k1,... , kn-1. Our main result is an (N + M)-time algorithm for deciding whether two GD strings of total sizes N and M, respectively, over an integer alphabet, have a non-empty intersection. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in linear space. We then apply our string comparison tool to devise a simple algorithm for computing all palindromes in S in (min{W, n2}N)-time. We complement this upper bound by showing a similar conditional lower bound for computing maximal palindromes in S. We also show that a result, which is essentially the same as our string comparison linear-time algorithm, can be obtained by employing an automata-based approach

    Finding the cyclic covers of a string

    No full text
    We introduce the concept of cyclic covers, which generalizes the classical notion of covers in strings. Given any nonempty string X of length n, a factor W of X is called a cyclic cover if every position of X belongs to an occurrence of a cyclic shift of W. Two cyclic covers are distinct if one is not a cyclic shift of the other. The cyclic cover problem requires finding all distinct cyclic covers of X. We present an algorithm that solves the cyclic cover problem in time. This is based on finding a well-structured set of standard occurrences of a constant number of factors of a cyclic cover candidate W, computing the regions of X covered by cyclic shifts of W, extending those factors, and taking the union of the results

    Efficient Identification of k -Closed Strings

    No full text
    A closed string contains a proper factor occurring as both a prefix and a suffix but not elsewhere in the string. Closed strings were introduced by Fici (WORDS 2011) as objects of combinatorial interest. This paper addresses a new problem by extending the closed string problem to the k-closed string problem, for which a level of approximation is permitted up to a number of Hamming distance errors, set by the parameter k. We address the problem of deciding whether or not a given string of length n over an integer alphabet is k-closed and additionally specifying the border resulting in the string being k-closed. Specifically, we present an (kn)-time and (n)-space algorithm to achieve this along with the pseudocode of an implementation and proof-of-concept experimental results
    corecore