24,533 research outputs found
A suffix tree or not a suffix tree?
In this paper we study the structure of suffix trees. Given an unlabeled tree τ on n nodes and suffix links of its internal nodes, we ask the question ”Is τ a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as τ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. Deciding if τ is a suffix tree is not an easy task, because, with no restrictions on the final symbol, we cannot guess the length of a string that realizes τ from the number of leaves. And without an upper bound on the length of such a string, it is not even clear how to solve the problem by an exhaustive search. In this paper, we prove that τ is a suffix tree if and only if it is realized by a string S of length n−1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known. This generalizes the work of I et al. [Discrete Appl. Math. 163, 2014]
Constructing suffix arrays in linear time
AbstractThe time complexity of suffix tree construction has been shown to be equivalent to that of sorting: O(n) for a constant-size alphabet or an integer alphabet and O(nlogn) for a general alphabet. However, previous algorithms for constructing suffix arrays have the time complexity of O(nlogn) even for a constant-size alphabet.In this paper we present a linear-time algorithm to construct suffix arrays for integer alphabets, which do not use suffix trees as intermediate data structures during its construction. Since the case of a constant-size alphabet can be subsumed in that of an integer alphabet, our result implies that the time complexity of directly constructing suffix arrays matches that of constructing suffix trees
Multiple Buffering for Parallel Approximate Sequence Matching using Disk-based Suffix Tree on Multi-core CPU
Suffix trees, which are trie structures that presentthe suffixes of sequences (e.g., strings), are widely used for sequencesearch in different application domains such as, text datamining, bioinformatics and computational biology. In particular,suffix trees are useful in bioinformatics applications, because theycan search similar sub-sequences and extract frequent sequencepatterns efficiently. In recent years, efficient construction of asuffix tree that allows faster sequence searches has becomeone of the most important challenges, because the numberand size of the data that are stored in sequence databaseshave been increasing exponentially. This paper proposes a novelparallelization model for approximate sequence matching thatuses disk-based suffix trees, which are built on hard disks not onmemory, on a multi-core CPU. In the proposed parallelizationmodel, we divide an entire sequence database into two or moresub-databases called partitions. For each partition, we builda disk-based suffix tree and define a task as an approximatesequence matching on one disk-based suffix tree. Moreover,the proposed parallelization model involves a multiple bufferingmanagement system to avoid conflicts among CPU-cores. Weevaluated the proposed parallelization model using an actualamino acid sequence database on a PC. The experimental resultsshow a substantial improvement in computation performance
Managing Unbounded-Length Keys in Comparison-Driven Data Structures with Applications to On-Line Indexing
This paper presents a general technique for optimally transforming any
dynamic data structure that operates on atomic and indivisible keys by
constant-time comparisons, into a data structure that handles unbounded-length
keys whose comparison cost is not a constant. Examples of these keys are
strings, multi-dimensional points, multiple-precision numbers, multi-key data
(e.g.~records), XML paths, URL addresses, etc. The technique is more general
than what has been done in previous work as no particular exploitation of the
underlying structure of is required. The only requirement is that the insertion
of a key must identify its predecessor or its successor.
Using the proposed technique, online suffix tree can be constructed in worst
case time per input symbol (as opposed to amortized
time per symbol, achieved by previously known algorithms). To our knowledge,
our algorithm is the first that achieves worst case time per input
symbol. Searching for a pattern of length in the resulting suffix tree
takes time, where is the
number of occurrences of the pattern. The paper also describes more
applications and show how to obtain alternative methods for dealing with suffix
sorting, dynamic lowest common ancestors and order maintenance
Sparse Suffix and LCP Array:Simple, Direct, Small, and Fast
Sparse suffix sorting is the problem of sorting b = o(n) suffixes of a string of length n. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for applications in text indexing, the existing algorithms have not been employed by practitioners. Arguably this is because there are no simple, direct, and efficient algorithms for sparse suffix array construction. We provide two new algorithms for constructing the sparse suffix and LCP arrays that are simultaneously simple, direct, small, and fast. In particular, our algorithms are: simple in the sense that they can be implemented using only basic data structures; direct in the sense that the output arrays are not a byproduct of constructing the sparse suffix tree or an LCE data structure; fast in the sense that they run in O(n log b) time, in the worst case, or in O(n) time, when the total number of suffixes with an LCP value greater than 2⌊log n/b⌋+1− 1 is in O(b/ log b), matching the time of optimal yet much more complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et al., SODA 2020]; and small in the sense that they can be implemented using only 8b + o(b) machine words. We also show that our second algorithm can be trivially amended to work in O(n) time for any uniformly random string. Our algorithms are non-trivial space-efficient adaptations of the Monte Carlo algorithm by I et al. for constructing the sparse suffix tree in O(n log b) time [STACS 2014]
Sparse Suffix and LCP Array: Simple, Direct, Small, and Fast
Sparse suffix sorting is the problem of sorting suffixes of a string
of length . Efficient sparse suffix sorting algorithms have existed for more
than a decade. Despite the multitude of works and their justified claims for
applications in text indexing, the existing algorithms have not been employed
by practitioners. Arguably this is because there are no simple, direct, and
efficient algorithms for sparse suffix array construction. We provide two new
algorithms for constructing the sparse suffix and LCP arrays that are
simultaneously simple, direct, small, and fast. In particular, our algorithms
are: simple in the sense that they can be implemented using only basic data
structures; direct in the sense that the output arrays are not a byproduct of
constructing the sparse suffix tree or an LCE data structure; fast in the sense
that they run in time, in the worst case, or in
time, when the total number of suffixes with an LCP value
greater than is in
, matching the time of the optimal yet much more
complicated algorithms [Gawrychowski and Kociumaka, SODA 2017; Birenzwige et
al., SODA 2020]; and small in the sense that they can be implemented using only
machine words. Our algorithms are simplified, yet non-trivial,
space-efficient adaptations of the Monte Carlo algorithm by I et al. for
constructing the sparse suffix tree in time [STACS
2014]. We also provide proof-of-concept experiments to justify our claims on
simplicity and efficiency.Comment: 16 pages, 1 figur
Text Mining Untuk Pencarian Dokumen Bahasa Inggris Menggunakan Suffix Tree Clustering
A search of the collection of documents generally provide excerpts of the documents are arranged according to rank matches in a long list. Not infrequently a search result in tens and even hundreds of fragments of documents that caused a user to scroll the screen up and down (scrolling) to examine the documents snippet one by one. This situation causes a user is having difficulty in determining which documents relevant to the topic he wants. In this Final Project developed an application web-based document segmentation with suffix tree clustering method. The basic concept of this method is to classify documents in the search results to form groups or clusters based on words or phrases contained in these documents. The application requires the search input and output will result in clusters containing the corresponding documents. This cluster can be stratified depending on the word or phrase that might be distinguished on the same parent cluster. Clusters generated is displayed to the user. Then on the last cluster is selected will display a collection of documents, each consisting of the title and snippet of the document. With this method expected results would be easier to trace. Keywords : text mining, suffix tree, suffix tree clustering, the grouping of documents
- …