882 research outputs found

    A practical and efficient algorithm for the k-mismatch shortest unique substring finding problem

    Get PDF
    This thesis revisits the k-mismatch shortest unique substring (SUS) finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n logk n), while maintaining a practical space complexity at O(kn), where n is the string length. When k \u3e 0, which is the hard case, the new proposal significantly improves the any-case O(n2) time complexity of the prior best method for k-mismatch SUS finding. Experimental study shows that the new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution\u27s implementation when k is small relative to n. For example, the proposed method processes a 200KB sample DNA sequence with k = 1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel resulting in further significant practical performance improvement. As an example, when using 8 cores to process a 10MB sample DNA sequence with k = 2, two parallel implementations each achieved processing times less than 1=4 that of the serial implementation. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that trading additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may require years to process a 200MB DNA sample for any k \u3e 0, while this new proposal, using 24 cores, finished processing a sample of this size with k = 1 in 206:376 seconds with a peak memory usage of 46GB, which is easily available and affordable for many users. It is expected that this new practical and efficient algorithm for k-mismatch SUS finding will prove useful to those using the measure on long sequences in fields such as computational biology

    Range Shortest Unique Substring queries

    Get PDF
    Let be a string of length n and be the substring of starting at position i and ending at position j. A substring of is a repeat if it occurs more than once in; otherwise, it is a unique substring of. Repeats and unique substrings are of great interest in computational biology and in information retrieval. Given string as input, the Shortest Unique Substring problem is to find a shortest substring of that does not occur elsewhere in. In this paper, we introduce the range variant of this problem, which we call the Range Shortest Unique Substring problem. The task is to construct a data structure over answering the following type of online queries efficiently. Given a range, return a shortest substring of with exactly one occurrence in. We present an -word data structure with query time, where is the word size. Our construction is based on a non-trivial reduction allowing us to apply a recently introduced optimal geometric data structure [Chan et al. ICALP 2018]

    Simple and dynamic data structure for pattern matching in texts, A

    Get PDF
    2011 Summer.Includes bibliographical references.The demand for a pattern matching algorithm is currently on the rise from diverse areas such as string search, image matching, voice recognition and bioinformatics. In particular, string search or matching algorithms have been growing in popularity as they have been applied to areas such as text editors, search engines and bioinformatics. To satisfy these various demands, many string matching methods have been developed to search for substrings (pattern strings) within a text, and several techniques employ the use of tree data structures, deterministic finite automata, and other structures. The problem of string matching is defined by finding all location of a pattern string P within a text T, where preprocessing of T is allowed in order to facilitate the queries. There has been significant success in finding a pattern string in O(m+k) time, where m is the length of the pattern string and k is the number of occurrences, using data structures that can be constructed in O(n) time, where n is the length of T. Suffix trees and directed acyclic word graphs are such data structures. All of these data structures index the searched text in O(m+k) time. However, the difficulty of understanding and programming the construction algorithms is rarely mentioned. Also, they have significant space requirements and take Θ(n) time to update even if one character of T is changed. To solve these problems, we propose the augmented position heap. It can be built in O(n) time, and can be used to search a pattern string in O(m+k) time. Most importantly, when a block of j characters are inserted or deleted, the asymptotic updating it when a text is modified is O((h(T) + j)h(T)), where h(T) is the length of the longest substring X of T that occurs at least ||X|| times in T, where ||X|| is the length of X. For texts arising from practical applications, h(T) is typically slowly growing function of ||T||; for a random text T, its expected value is O(logn). Another issue in data structures that must be addressed is space requirement. The most space efficient data structure for string search is the suffix array, which uses 2n words and supports searches in O(nlogn + m + k). A compact representation of the position heap proposed in this thesis also takes 2n words, but can be updated in O((h(T) + j)h(T)) time, but takes O(m2+k) time for a search. The best bound known bound for updating the suffix array or the directed acyclic word graph is O(n), and they both take considerably more space. A compact representation proposed in this thesis for the augmented position heap takes 4n words, can be updated just as efficiently as the position heap, and takes O(m+k) time for a search

    Can We Recover the Cover?

    Get PDF
    Data analysis typically involves error recovery and detection of regularities as two different key tasks. In this paper we show that there are data types for which these two tasks can be powerfully combined. A common notion of regularity in strings is that of a cover. Data describing measures of a natural coverable phenomenon may be corrupted by errors caused by the measurement process, or by the inexact features of the phenomenon itself. Due to this reason, different variants of approximate covers have been introduced, some of which are NP-hard to compute. In this paper we assume that the Hamming distance metric measures the amount of corruption experienced, and study the problem of recovering the correct cover from data corrupted by mismatch errors, formally defined as the cover recovery problem (CRP). We show that for the Hamming distance metric, coverability is a powerful property allowing detecting the original cover and correcting the data, under suitable conditions. We also study a relaxation of another problem, which is called the approximate cover problem (ACP). Since the ACP is proved to be NP-hard [Amir,Levy,Lubin,Porat, CPM 2017], we study a relaxation, which we call the candidate-relaxation of the ACP, and show it has a polynomial time complexity. As a result, we get that the ACP also has a polynomial time complexity in many practical situations. An important application of our ACP relaxation study is also a polynomial time algorithm for the cover recovery problem (CRP)

    Quasi-Periodicity in Streams

    Get PDF
    In this work, we show two streaming algorithms for computing the length of the shortest cover of a string of length n. We start by showing a two-pass algorithm that uses O(log^2 n) space and then show a one-pass streaming algorithm that uses O(sqrt{n log n}) space. Both algorithms run in near-linear time. The algorithms are randomized and compute the answer incorrectly with probability inverse-polynomial in n. We also show that there is no sublinear-space streaming algorithm for computing the length of the shortest seed of a string
    • …