69 research outputs found

    Simple and dynamic data structure for pattern matching in texts, A

    Get PDF
    2011 Summer.Includes bibliographical references.The demand for a pattern matching algorithm is currently on the rise from diverse areas such as string search, image matching, voice recognition and bioinformatics. In particular, string search or matching algorithms have been growing in popularity as they have been applied to areas such as text editors, search engines and bioinformatics. To satisfy these various demands, many string matching methods have been developed to search for substrings (pattern strings) within a text, and several techniques employ the use of tree data structures, deterministic finite automata, and other structures. The problem of string matching is defined by finding all location of a pattern string P within a text T, where preprocessing of T is allowed in order to facilitate the queries. There has been significant success in finding a pattern string in O(m+k) time, where m is the length of the pattern string and k is the number of occurrences, using data structures that can be constructed in O(n) time, where n is the length of T. Suffix trees and directed acyclic word graphs are such data structures. All of these data structures index the searched text in O(m+k) time. However, the difficulty of understanding and programming the construction algorithms is rarely mentioned. Also, they have significant space requirements and take Θ(n) time to update even if one character of T is changed. To solve these problems, we propose the augmented position heap. It can be built in O(n) time, and can be used to search a pattern string in O(m+k) time. Most importantly, when a block of j characters are inserted or deleted, the asymptotic updating it when a text is modified is O((h(T) + j)h(T)), where h(T) is the length of the longest substring X of T that occurs at least ||X|| times in T, where ||X|| is the length of X. For texts arising from practical applications, h(T) is typically slowly growing function of ||T||; for a random text T, its expected value is O(logn). Another issue in data structures that must be addressed is space requirement. The most space efficient data structure for string search is the suffix array, which uses 2n words and supports searches in O(nlogn + m + k). A compact representation of the position heap proposed in this thesis also takes 2n words, but can be updated in O((h(T) + j)h(T)) time, but takes O(m2+k) time for a search. The best bound known bound for updating the suffix array or the directed acyclic word graph is O(n), and they both take considerably more space. A compact representation proposed in this thesis for the augmented position heap takes 4n words, can be updated just as efficiently as the position heap, and takes O(m+k) time for a search

    Efficient Methods for Multigram Compound Discovery

    Get PDF

    Online Algorithms for Constructing Linear-Size Suffix Trie

    Get PDF
    The suffix trees are fundamental data structures for various kinds of string processing. The suffix tree of a string T of length n has O(n) nodes and edges, and the string label of each edge is encoded by a pair of positions in T. Thus, even after the tree is built, the input text T needs to be kept stored and random access to T is still needed. The linear-size suffix tries (LSTs), proposed by Crochemore et al. [Linear-size suffix tries, TCS 638:171-178, 2016], are a "stand-alone" alternative to the suffix trees. Namely, the LST of a string T of length n occupies O(n) total space, and supports pattern matching and other tasks in the same efficiency as the suffix tree without the need to store the input text T. Crochemore et al. proposed an offline algorithm which transforms the suffix tree of T into the LST of T in O(n log sigma) time and O(n) space, where sigma is the alphabet size. In this paper, we present two types of online algorithms which "directly" construct the LST, from right to left, and from left to right, without constructing the suffix tree as an intermediate structure. Both algorithms construct the LST incrementally when a new symbol is read, and do not access to the previously read symbols. The right-to-left construction algorithm works in O(n log sigma) time and O(n) space and the left-to-right construction algorithm works in O(n (log sigma + log n / log log n)) time and O(n) space. The main feature of our algorithms is that the input text does not need to be stored

    Accurate Cardinality Estimation of Co-occurring Words Using Suffix Trees (Extended Version)

    Get PDF
    Estimating the cost of a query plan is one of the hardest problems in query optimization. This includes cardinality estimates of string search patterns, of multi-word strings like phrases or text snippets in particular. At first sight, suffix trees address this problem. To curb the memory usage of a suffix tree, one often prunes the tree to a certain depth. But this pruning method "takes away" more information from long strings than from short ones. This problem is particularly severe with sets of long strings, the setting studied here. In this article, we propose respective pruning techniques. Our approaches remove characters with low information value. The various variants determine a character\u27s information value in different ways, e.g., by using conditional entropy with respect to previous characters in the string. Our experiments show that, in contrast to the well-known pruned suffix tree, our technique provides significantly better estimations when the tree size is reduced by 60% or less. Due to the redundancy of natural language, our pruning techniques yield hardly any error for tree-size reductions of up to 50%

    EERTREE: An Efficient Data Structure for Processing Palindromes in Strings

    Full text link
    We propose a new linear-size data structure which provides a fast access to all palindromic substrings of a string or a set of strings. This structure inherits some ideas from the construction of both the suffix trie and suffix tree. Using this structure, we present simple and efficient solutions for a number of problems involving palindromes.Comment: 21 pages, 2 figures. Accepted to IWOCA 201
    corecore