34 research outputs found
Optimal-Time Queries on BWT-Runs Compressed Indexes
Indexing highly repetitive strings (i.e., strings with many repetitions) for fast queries has become a central research topic in string processing, because it has a wide variety of applications in bioinformatics and natural language processing. Although a substantial number of indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support various queries remains a challenge. The run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has received interest for indexing highly repetitive strings. LF and ?^{-1} are two key functions for building indexes on RLBWT, and the best previous result computes LF and ?^{-1} in O(log log n) time with O(r) words of space for the string length n and the number r of runs in RLBWT. In this paper, we improve LF and ?^{-1} so that they can be computed in a constant time with O(r) words of space. Subsequently, we present OptBWTR (optimal-time queries on BWT-runs compressed indexes), the first string index that supports various queries including locate, count, extract queries in optimal time and O(r) words of space
Conversion from RLBWT to LZ77
Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string into Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Policriti and Prezza\u27s conversion algorithm [Algorithmica 2018] were O(n log r) time and O(r) working space for length of the string n, number of runs r in the RLBWT, and number of LZ77 phrases z. Recent results with Kempa\u27s conversion algorithm [SODA 2019] are O(n / log n + r log^{9} n + z log^{9} n) time and O(n / log_{sigma} n + r log^{8} n) working space for the alphabet size sigma of the RLBWT. In this paper, we present a new conversion algorithm by improving Policriti and Prezza\u27s conversion algorithm where dynamic data structures for general purpose are used. We argue that these dynamic data structures can be replaced and present new data structures for faster conversion. The time and working space of our conversion algorithm with new data structures are O(n min{log log n, sqrt{(log r)/(log log r)}}) and O(r), respectively
R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space
Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes
Fully dynamic data structure for LCE queries in compressed space
A Longest Common Extension (LCE) query on a text of length asks for
the length of the longest common prefix of suffixes starting at given two
positions. We show that the signature encoding of size [Mehlhorn et al., Algorithmica 17(2):183-198,
1997] of , which can be seen as a compressed representation of , has a
capability to support LCE queries in time,
where is the answer to the query, is the size of the Lempel-Ziv77
(LZ77) factorization of , and is an integer that can be handled
in constant time under word RAM model. In compressed space, this is the fastest
deterministic LCE data structure in many cases. Moreover, can be
enhanced to support efficient update operations: After processing
in time, we can insert/delete any (sub)string of length
into/from an arbitrary position of in time, where . This yields
the first fully dynamic LCE data structure. We also present efficient
construction algorithms from various types of inputs: We can construct
in time from uncompressed string ; in
time from grammar-compressed string
represented by a straight-line program of size ; and in time from LZ77-compressed string with factors. On top
of the above contributions, we show several applications of our data structures
which improve previous best known results on grammar-compressed string
processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695
Computing NP-Hard Repetitiveness Measures via MAX-SAT
Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors