298 research outputs found

    Approximate Circular Pattern Matching

    Get PDF
    We investigate the complexity of approximate circular pattern matching (CPM, in short) under the Hamming and edit distance. Under each of these two basic metrics, we are given a length-n text T, a length-m pattern P, and a positive integer threshold k, and we are to report all starting positions (called occurrences) of fragments of T that are at distance at most k from some cyclic rotation of P. In the decision version of the problem, we are to check if there is any such occurrence. All previous results for approximate CPM were either average-case upper bounds or heuristics, with the exception of the work of Charalampopoulos et al. [CKP+, JCSS'21], who considered only the Hamming distance. For the reporting version of the approximate CPM problem, under the Hamming distance we improve upon the main algorithm of [CKP+, JCSS'21] from O(n+(n/m) k4) to O(n+(n/m) k3 log log k) time; for the edit distance, we give an O(nk2)-time algorithm. Notably, for the decision versions and wide parameter-ranges, we give algorithms whose complexities are almost identical to the state-of-the-art for standard (i.e., non-circular) approximate pattern matching: For the decision version of the approximate CPM problem under the Hamming distance, we obtain an O(n + (n/m) k2 log k/ log log k)-time algorithm, which works in O(n) time whenever k = O( p mlog log m/logm). In comparison, the fastest algorithm for the standard counterpart of the problem, by Chan et al. [CGKKP, STOC'20], runs in O(n) time only for k = O(√ m). We achieve this result via a reduction to a geometric problem by building on ideas from [CKP+, JCSS'21] and Charalampopoulos et al. [CKW, FOCS'20]. For the decision version of the approximate CPM problem under the edit distance, the O(nk log3 k) runtime of our algorithm near matches the O(nk) runtime of the Landau-Vishkin algorithm [LV, J. Algorithms'89] for approximate pattern matching under edit distance; the latter algorithm remains the fastest known for k = Ω(m2/5). As a stepping stone, we propose an O(nk log3 k)-time algorithm for solving the Longest Prefix k-Approximate Match problem, proposed by Landau et al. [LMS, SICOMP'98], for all k ∈ {1, , k}. Our algorithm is based on Tiskin's theory of seaweeds [Tiskin, Math. Comput. Sci.'08], with recent advancements (see Charalampopoulos et al. [CKW, FOCS'22]), and on exploiting the seaweeds' relation to Monge matrices. In contrast, we obtain a conditional lower bound that suggests a polynomial separation between approximate CPM under the Hamming distance over the binary alphabet and its non-circular counterpart. We also show that a strongly subquadratic-time algorithm for the decision version of approximate CPM under edit distance would refute the Strong Exponential Time Hypothesis

    k-Approximate Quasiperiodicity under Hamming and Edit Distance

    Get PDF
    Quasiperiodicity in strings was introduced almost 30 years ago as an extension of string periodicity. The basic notions of quasiperiodicity are cover and seed. A cover of a text T is a string whose occurrences in T cover all positions of T. A seed of text T is a cover of a superstring of T. In various applications exact quasiperiodicity is still not sufficient due to the presence of errors. We consider approximate notions of quasiperiodicity, for which we allow approximate occurrences in T with a small Hamming, Levenshtein or weighted edit distance. In previous work Sip et al. (2002) and Christodoulakis et al. (2005) showed that computing approximate covers and seeds, respectively, under weighted edit distance is NP-hard. They, therefore, considered restricted approximate covers and seeds which need to be factors of the original string T and presented polynomial-time algorithms for computing them. Further algorithms, considering approximate occurrences with Hamming distance bounded by k, were given in several contributions by Guth et al. They also studied relaxed approximate quasiperiods that do not need to cover all positions of T. In case of large data the exponents in polynomial time complexity play a crucial role. We present more efficient algorithms for computing restricted approximate covers and seeds. In particular, we improve upon the complexities of many of the aforementioned algorithms, also for relaxed quasiperiods. Our solutions are especially efficient if the number (or total cost) of allowed errors is bounded. We also show NP-hardness of computing non-restricted approximate covers and seeds under Hamming distance. Approximate covers were studied in three recent contributions at CPM over the last three years. However, these works consider a different definition of an approximate cover of T, that is, the shortest exact cover of a string T\u27 with the smallest Hamming distance from T

    On the Benefit of Merging Suffix Array Intervals for Parallel Pattern Matching

    Get PDF
    We present parallel algorithms for exact and approximate pattern matching with suffix arrays, using a CREW-PRAM with pp processors. Given a static text of length nn, we first show how to compute the suffix array interval of a given pattern of length mm in O(mp+lgp+lglgplglgn)O(\frac{m}{p}+ \lg p + \lg\lg p\cdot\lg\lg n) time for pmp \le m. For approximate pattern matching with kk differences or mismatches, we show how to compute all occurrences of a given pattern in O(mkσkpmax(k,lglgn) ⁣+ ⁣(1+mp)lgplglgn+occ)O(\frac{m^k\sigma^k}{p}\max\left(k,\lg\lg n\right)\!+\!(1+\frac{m}{p}) \lg p\cdot \lg\lg n + \text{occ}) time, where σ\sigma is the size of the alphabet and pσkmkp \le \sigma^k m^k. The workhorse of our algorithms is a data structure for merging suffix array intervals quickly: Given the suffix array intervals for two patterns PP and PP', we present a data structure for computing the interval of PPPP' in O(lglgn)O(\lg\lg n) sequential time, or in O(1+lgplgn)O(1+\lg_p\lg n) parallel time. All our data structures are of size O(n)O(n) bits (in addition to the suffix array)

    String Covering: A Survey

    Full text link
    The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future

    A Novel Tree Structure for Pattern Matching in Biological Sequences

    Get PDF
    This dissertation proposes a novel tree structure, Error Tree (ET), to more efficiently solve the Approximate Pattern Matching problem, a fundamental problem in bioinformatics and information retrieval. The problem involves different matching measures such as the Hamming distance, edit distance, and wildcard matching. The input is usually a text of length n over a fixed alphabet of size Σ, a pattern P of length m, and an integer k. The output is those subsequences in the text that are at a distance ≤ k from P by Hamming distance, edit distance, or wildcard matching. An immediate application of the approximate pattern matching is the Planted Motif Search, an important problem in many biological applications such as finding promoters, enhancers, locus control regions, transcription factors, etc. The (l, d)-Planted Motif Search is defined as the following: Given n sequences over an alphabet of size Σ, each of length m, and two integers l and d, find a motif M of length l, where in each sequence there is at least an l-mer (substring of length l) at a Hamming distance of ≤ d from M. Based on the ET structure, our algorithm ET-Motif solves this problem efficiently in time and space. The thesis also discusses how the ET structure may add efficiency when it comes to Genome Assembly and DNA Sequence Compression. Current high-throughput sequencing technologies generate millions or billions of short reads (100-1000 bases) that are sequenced from a genome of millions or billions bases long. The De novo Genome Assembly problem is to assemble the original genome as long and accurate as possible. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter is costly to generate. Moreover, the recent GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with a very high coverage. This thesis introduces a novel Hierarchical Genome Assembly (HGA) method that takes further advantage of such high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. We empirically evaluate this methodology for eight leading assemblers using seven GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads with coverage ranging from 100x-∼200x. The results show that HGA leads to a significant improvement in the quality of the assembly for all evaluated assemblers and datasets. Still, the problem involves a major step which is overlapping the ends of the reads together and allowing few mismatches (i.e. the approximate matching problem). This requires computing the overlaps between the ends of all-against-all reads. The computation of such overlaps when allowing mismatches is intensive. The ET structure may further speed up this step. Lastly, due to the significant amount of DNA data generated by the Next- Generation-Sequencing machines, there is an increasing need to compress such data to reduce the storage space and transmission time. The Huffman encoding that incorporates DNA sequence characteristics proves to better compress DNA data. Different implementations of Huffman trees, centering on the selection of frequent repeats, are introduced in this thesis. Experimental results demonstrate improvement on the compression ratios for five genomes with lengths ranging from 5Mbp to 50Mbp, compared with the use of a standard Huffman tree algorithm. Hence, the thesis suggests an improvement on all DNA sequence compression algorithms that employ the conventional Huffman encoding. Moreover, approximate repeats can be compressed and further improve the results by encoding the Hamming or edit distance between these repeats. However, computing such distances requires additional costs in both time and space. These costs can be reduced by using the ET structure

    Approximate circular pattern matching

    Get PDF
    We investigate the complexity of approximate circular pattern matching (CPM, in short) under the Hamming and edit distance. Under each of these two basic metrics, we are given a length-n text T, a length-m pattern P, and a positive integer threshold k, and we are to report all starting positions (called occurrences) of fragments of T that are at distance at most k from some cyclic rotation of P. In the decision version of the problem, we are to check if there is any such occurrence. All previous results for approximate CPM were either average-case upper bounds or heuristics, with the exception of the work of Charalampopoulos et al. [CKP+, JCSS'21], who considered only the Hamming distance. For the reporting version of the approximate CPM problem, under the Hamming distance we improve upon the main algorithm of [CKP+, JCSS'21] from O(n+(n/m) k4) to O(n+(n/m) k3 log log k) time; for the edit distance, we give an O(nk2)-time algorithm. Notably, for the decision versions and wide parameter-ranges, we give algorithms whose complexities are almost identical to the state-of-the-art for standard (i.e., non-circular) approximate pattern matching: For the decision version of the approximate CPM problem under the Hamming distance, we obtain an O(n + (n/m) k2 log k/ log log k)-time algorithm, which works in O(n) time whenever k = O( p mlog log m/logm). In comparison, the fastest algorithm for the standard counterpart of the problem, by Chan et al. [CGKKP, STOC'20], runs in O(n) time only for k = O(√ m). We achieve this result via a reduction to a geometric problem by building on ideas from [CKP+, JCSS'21] and Charalampopoulos et al. [CKW, FOCS'20]. For the decision version of the approximate CPM problem under the edit distance, the O(nk log3 k) runtime of our algorithm near matches the O(nk) runtime of the Landau-Vishkin algorithm [LV, J. Algorithms'89] for approximate pattern matching under edit distance; the latter algorithm remains the fastest known for k = Ω(m2/5). As a stepping stone, we propose an O(nk log3 k)-time algorithm for solving the Longest Prefix k-Approximate Match problem, proposed by Landau et al. [LMS, SICOMP'98], for all k ∈ {1, , k}. Our algorithm is based on Tiskin's theory of seaweeds [Tiskin, Math. Comput. Sci.'08], with recent advancements (see Charalampopoulos et al. [CKW, FOCS'22]), and on exploiting the seaweeds' relation to Monge matrices. In contrast, we obtain a conditional lower bound that suggests a polynomial separation between approximate CPM under the Hamming distance over the binary alphabet and its non-circular counterpart. We also show that a strongly subquadratic-time algorithm for the decision version of approximate CPM under edit distance would refute the Strong Exponential Time Hypothesis
    corecore