14 research outputs found
Speeding-up -gram mining on grammar-based compressed texts
We present an efficient algorithm for calculating -gram frequencies on
strings represented in compressed form, namely, as a straight line program
(SLP). Given an SLP of size that represents string , the
algorithm computes the occurrence frequencies of all -grams in , by
reducing the problem to the weighted -gram frequencies problem on a
trie-like structure of size , where
is a quantity that represents the amount of
redundancy that the SLP captures with respect to -grams. The reduced problem
can be solved in linear time. Since , the running time of our
algorithm is , improving our
previous algorithm when
Improved ESP-index: a practical self-index for highly repetitive texts
While several self-indexes for highly repetitive texts exist, developing a
practical self-index applicable to real world repetitive texts remains a
challenge. ESP-index is a grammar-based self-index on the notion of
edit-sensitive parsing (ESP), an efficient parsing algorithm that guarantees
upper bounds of parsing discrepancies between different appearances of the same
subtexts in a text. Although ESP-index performs efficient top-down searches of
query texts, it has a serious issue on binary searches for finding appearances
of variables for a query text, which resulted in slowing down the query
searches. We present an improved ESP-index (ESP-index-I) by leveraging the idea
behind succinct data structures for large alphabets. While ESP-index-I keeps
the same types of efficiencies as ESP-index about the top-down searches, it
avoid the binary searches using fast rank/select operations. We experimentally
test ESP-index-I on the ability to search query texts and extract subtexts from
real world repetitive texts on a large-scale, and we show that ESP-index-I
performs better that other possible approaches.Comment: This is the full version of a proceeding accepted to the 11th
International Symposium on Experimental Algorithms (SEA2014
Rank, select and access in grammar-compressed strings
Given a string of length on a fixed alphabet of symbols, a
grammar compressor produces a context-free grammar of size that
generates and only . In this paper we describe data structures to
support the following operations on a grammar-compressed string:
\mbox{rank}_c(S,i) (return the number of occurrences of symbol before
position in ); \mbox{select}_c(S,i) (return the position of the th
occurrence of in ); and \mbox{access}(S,i,j) (return substring
). For rank and select we describe data structures of size
bits that support the two operations in time. We
propose another structure that uses
bits and that supports the two queries in , where
is an arbitrary constant. To our knowledge, we are the first to
study the asymptotic complexity of rank and select in the grammar-compressed
setting, and we provide a hardness result showing that significantly improving
the bounds we achieve would imply a major breakthrough on a hard
graph-theoretical problem. Our main result for access is a method that requires
bits of space and time to extract
consecutive symbols from . Alternatively, we can achieve query time using bits of space. This matches a lower bound stated by Verbin
and Yu for strings where is polynomially related to .Comment: 16 page
Compact q-gram Profiling of Compressed Strings
We consider the problem of computing the q-gram profile of a string \str of
size compressed by a context-free grammar with production rules. We
present an algorithm that runs in expected time and uses
O(n+q+\kq) space, where is the exact number of characters
decompressed by the algorithm and \kq\leq N-\alpha is the number of distinct
q-grams in \str. This simultaneously matches the current best known time
bound and improves the best known space bound. Our space bound is
asymptotically optimal in the sense that any algorithm storing the grammar and
the q-gram profile must use \Omega(n+q+\kq) space. To achieve this we
introduce the q-gram graph that space-efficiently captures the structure of a
string with respect to its q-grams, and show how to construct it from a
grammar
Extended Formulations via Decision Diagrams
We propose a general algorithm of constructing an extended formulation for
any given set of linear constraints with integer coefficients. Our algorithm
consists of two phases: first construct a decision diagram that somehow
represents a given constraint matrix, and then build an equivalent
set of linear constraints over variables. That is, the size of
the resultant extended formulation depends not explicitly on the number of
the original constraints, but on its decision diagram representation.
Therefore, we may significantly reduce the computation time for optimization
problems with integer constraint matrices by solving them under the extended
formulations, especially when we obtain concise decision diagram
representations for the matrices. We can apply our method to -norm
regularized hard margin optimization over the binary instance space
, which can be formulated as a linear programming problem with
constraints with -valued coefficients over variables, where
is the size of the given sample. Furthermore, introducing slack variables over
the edges of the decision diagram, we establish a variant formulation of soft
margin optimization. We demonstrate the effectiveness of our extended
formulations for integer programming and the -norm regularized soft margin
optimization tasks over synthetic and real datasets