8 research outputs found
An Improved Data Structure for Left-Right Maximal Generic Words Problem
For a set D of documents and a positive integer d, a string w is said to be d-left-right maximal, if (1) w occurs in at least d documents in D, and (2) any proper superstring of w occurs in less than d documents. The left-right-maximal generic words problem is, given a set D of documents, to preprocess D so that for any string p and for any positive integer d, all the superstrings of p that are d-left-right maximal can be answered quickly. In this paper, we present an O(n log m) space data structure (in words) which answers queries in O(|p| + o log log m) time, where n is the total length of documents in D, m is the number of documents in D and o is the number of outputs. Our solution improves the previous one by Nishimoto et al. (PSC 2015), which uses an O(n log n) space data structure answering queries in O(|p|+ r * log n + o * log^2 n) time, where r is the number of right-extensions q of p occurring in at least d documents such that any proper right extension of q occurs in less than d documents
Almost Linear Time Computation of Maximal Repetitions in Run Length Encoded Strings
We consider the problem of computing all maximal repetitions contained in a string that is given in run-length encoding.
Given a run-length encoding of a string, we show that the maximum number of maximal repetitions contained in the string is at most m+k-1, where m is the size of the run-length encoding, and k is the number of run-length factors whose exponent is at least 2.
We also show an algorithm for computing all maximal repetitions in O(m alpha(m)) time and O(m) space, where alpha denotes the inverse Ackermann function
Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets
The directed acyclic word graph (DAWG) of a string of length is the
smallest (partial) DFA which recognizes all suffixes of with only
nodes and edges. In this paper, we show how to construct the DAWG for the input
string from the suffix tree for , in time for integer alphabets
of polynomial size in . In so doing, we first describe a folklore algorithm
which, given the suffix tree for , constructs the DAWG for the reversed
string of in time. Then, we present our algorithm that builds the
DAWG for in time for integer alphabets, from the suffix tree for
. We also show that a straightforward modification to our DAWG construction
algorithm leads to the first -time algorithm for constructing the affix
tree of a given string over an integer alphabet. Affix trees are a text
indexing structure supporting bidirectional pattern searches. We then discuss
how our constructions can lead to linear-time algorithms for building other
text indexing structures, such as linear-size suffix tries and symmetric CDAWGs
in linear time in the case of integer alphabets. As a further application to
our -time DAWG construction algorithm, we show that the set
of all minimal absent words (MAWs) of can be computed in
optimal, input- and output-sensitive time and
working space for integer alphabets.Comment: This is an extended version of the paper "Computing DAWGs and Minimal
Absent Words in Linear Time for Integer Alphabets" from MFCS 201
Faster STR-IC-LCS Computation via RLE
The constrained LCS problem asks one to find a longest common subsequence of two input strings A and B with some constraints. The STR-IC-LCS problem is a variant of the constrained LCS problem, where the solution must include a given constraint string C as a substring. Given two strings A and B of respective lengths M and N, and a constraint string C of length at most min{M, N}, the best known algorithm for the STR-IC-LCS problem, proposed by Deorowicz (Inf. Process. Lett., 11:423-426, 2012), runs in O(MN) time. In this work, we present an O(mN + nM)-time solution to the STR-IC-LCS problem, where m and n denote the sizes of the run-length encodings of A and B, respectively. Since m <= M and n <= N always hold, our algorithm is always as fast as Deorowicz\u27s algorithm, and is faster when input strings are compressible via RLE
Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets
The directed acyclic word graph (DAWG) of a string y is the smallest (partial) DFA which recognizes all suffixes of y and has only O(n) nodes and edges. We present the first O(n)-time algorithm for computing the DAWG of a given string y of length n over an integer alphabet of polynomial size in n. We also show that a straightforward modification to our DAWG construction algorithm leads to the first O(n)-time algorithm for constructing the affix tree of a given string y over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. As an application to our O(n)-time DAWG construction algorithm, we show that the set MAW(y) of all minimal absent words of y can be computed in optimal O(n + |MAW(y)|) time and O(n) working space for integer alphabets
Explainable and Local Correction of Classification Models Using Decision Trees
In practical machine learning, models are frequently updated, or corrected, to adapt to new datasets. In this study, we pose two challenges to model correction. First, the effects of corrections to the end-users need to be described explicitly, similar to standard software where the corrections are described as release notes. Second, the amount of corrections need to be small so that the corrected models perform similarly to the old models. In this study, we propose the first model correction method for classification models that resolves these two challenges. Our idea is to use an additional decision tree to correct the output of the old models. Thanks to the explainability of decision trees, the corrections are describable to the end-users, which resolves the first challenge. We resolve the second challenge by incorporating the amount of corrections when training the additional decision tree so that the effects of corrections to be small. Experiments on real data confirm the effectiveness of the proposed method compared to existing correction methods