24,533 research outputs found

    A suffix tree or not a suffix tree?

    Get PDF
    In this paper we study the structure of suffix trees. Given an unlabeled tree τ on n nodes and suffix links of its internal nodes, we ask the question ”Is τ a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as τ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. Deciding if τ is a suffix tree is not an easy task, because, with no restrictions on the final symbol, we cannot guess the length of a string that realizes τ from the number of leaves. And without an upper bound on the length of such a string, it is not even clear how to solve the problem by an exhaustive search. In this paper, we prove that τ is a suffix tree if and only if it is realized by a string S of length n−1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014]

    Constructing suffix arrays in linear time

    Get PDF
    AbstractThe time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constant-size alphabet or an integer alphabet and O(nlogn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(nlogn) even for a constant-size alphabet.In this paper we present a linear-time algorithm to construct suffix arrays for integer alphabets, which do not use suffix trees as intermediate data structures during its construction. Since the case of a constant-size alphabet can be subsumed in that of an integer alphabet, our result implies that the time complexity of directly constructing suffix arrays matches that of constructing suffix trees

    Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU

    Get PDF
    Suffix trees, which are trie structures that presentthe suffixes of sequences (e.g., strings), are widely used for sequencesearch in different application domains such as, text datamining, bioinformatics and computational biology. In particular,suffix trees are useful in bioinformatics applications, because theycan search similar sub-sequences and extract frequent sequencepatterns efficiently. In recent years, efficient construction of asuffix tree that allows faster sequence searches has becomeone of the most important challenges, because the numberand size of the data that are stored in sequence databaseshave been increasing exponentially. This paper proposes a novelparallelization model for approximate sequence matching thatuses disk-based suffix trees, which are built on hard disks not onmemory, on a multi-core CPU. In the proposed parallelizationmodel, we divide an entire sequence database into two or moresub-databases called partitions. For each partition, we builda disk-based suffix tree and define a task as an approximatesequence matching on one disk-based suffix tree. Moreover,the proposed parallelization model involves a multiple bufferingmanagement system to avoid conflicts among CPU-cores. Weevaluated the proposed parallelization model using an actualamino acid sequence database on a PC. The experimental resultsshow a substantial improvement in computation performance

    Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing

    Full text link
    This paper presents a general technique for optimally transforming any dynamic data structure that operates on atomic and indivisible keys by constant-time comparisons, into a data structure that handles unbounded-length keys whose comparison cost is not a constant. Examples of these keys are strings, multi-dimensional points, multiple-precision numbers, multi-key data (e.g.~records), XML paths, URL addresses, etc. The technique is more general than what has been done in previous work as no particular exploitation of the underlying structure of is required. The only requirement is that the insertion of a key must identify its predecessor or its successor. Using the proposed technique, online suffix tree can be constructed in worst case time O(logn)O(\log n) per input symbol (as opposed to amortized O(logn)O(\log n) time per symbol, achieved by previously known algorithms). To our knowledge, our algorithm is the first that achieves O(logn)O(\log n) worst case time per input symbol. Searching for a pattern of length mm in the resulting suffix tree takes O(min(mlogΣ,m+logn)+tocc)O(\min(m\log |\Sigma|, m + \log n) + tocc) time, where tocctocc is the number of occurrences of the pattern. The paper also describes more applications and show how to obtain alternative methods for dealing with suffix sorting, dynamic lowest common ancestors and order maintenance

    Sparse Suffix and LCP Array:Simple, Direct, Small, and Fast

    Get PDF
    Sparse suffix sorting is the problem of sorting b = o(n) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(n log b) time, in the worst case, or in O(n) time, when the total number of suffixes with an LCP value greater than 2⌊log n/b⌋+1− 1 is in O(b/ log b), matching the time of optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b + o(b) machine words. We also show that our second algorithm can be trivially amended to work in O(n) time for any uniformly random string. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(n log b) time [STACS 2014]

    Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast

    Full text link
    Sparse suffix sorting is the problem of sorting b=o(n)b=o(n) suffixes of a string of length nn. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(nlogb)\mathcal{O}(n\log b) time, in the worst case, or in O(n)\mathcal{O}(n) time, when the total number of suffixes with an LCP value greater than 2lognb+112^{\lfloor \log \frac{n}{b} \rfloor + 1}-1 is in O(b/logb)\mathcal{O}(b/\log b), matching the time of the optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b+o(b)8b+o(b) machine words. Our algorithms are simplified, yet non-trivial, space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(nlogb)\mathcal{O}(n\log b) time [STACS 2014]. We also provide proof-of-concept experiments to justify our claims on simplicity and efficiency.Comment: 16 pages, 1 figur

    Text Mining Untuk Pencarian Dokumen Bahasa Inggris Menggunakan Suffix Tree Clustering

    Get PDF
    A search of the collection of documents generally provide excerpts of the documents are arranged according to rank matches in a long list. Not infrequently a search result in tens and even hundreds of fragments of documents that caused a user to scroll the screen up and down (scrolling) to examine the documents snippet one by one. This situation causes a user is having difficulty in determining which documents relevant to the topic he wants. In this Final Project developed an application web-based document segmentation with suffix tree clustering method. The basic concept of this method is to classify documents in the search results to form groups or clusters based on words or phrases contained in these documents. The application requires the search input and output will result in clusters containing the corresponding documents. This cluster can be stratified depending on the word or phrase that might be distinguished on the same parent cluster. Clusters generated is displayed to the user. Then on the last cluster is selected will display a collection of documents, each consisting of the title and snippet of the document. With this method expected results would be easier to trace. Keywords : text mining, suffix tree, suffix tree clustering, the grouping of documents
    corecore