16,503 research outputs found
siEDM: an efficient string index and search algorithm for edit distance with moves
Although several self-indexes for highly repetitive text collections exist,
developing an index and search algorithm with editing operations remains a
challenge. Edit distance with moves (EDM) is a string-to-string distance
measure that includes substring moves in addition to ordinal editing operations
to turn one string into another. Although the problem of computing EDM is
intractable, it has a wide range of potential applications, especially in
approximate string retrieval. Despite the importance of computing EDM, there
has been no efficient method for indexing and searching large text collections
based on the EDM measure. We propose the first algorithm, named string index
for edit distance with moves (siEDM), for indexing and searching strings with
EDM. The siEDM algorithm builds an index structure by leveraging the idea
behind the edit sensitive parsing (ESP), an efficient algorithm enabling
approximately computing EDM with guarantees of upper and lower bounds for the
exact EDM. siEDM efficiently prunes the space for searching query strings by
the proposed method, which enables fast query searches with the same guarantee
as ESP. We experimentally tested the ability of siEDM to index and search
strings on benchmark datasets, and we showed siEDM's efficiency.Comment: 23 page
Heuristic algorithms for the Longest Filled Common Subsequence Problem
At CPM 2017, Castelli et al. define and study a new variant of the Longest
Common Subsequence Problem, termed the Longest Filled Common Subsequence
Problem (LFCS). For the LFCS problem, the input consists of two strings and
and a multiset of characters . The goal is to insert the
characters from into the string , thus obtaining a new string
, such that the Longest Common Subsequence (LCS) between and is
maximized. Casteli et al. show that the problem is NP-hard and provide a
3/5-approximation algorithm for the problem.
In this paper we study the problem from the experimental point of view. We
introduce, implement and test new heuristic algorithms and compare them with
the approximation algorithm of Casteli et al. Moreover, we introduce an Integer
Linear Program (ILP) model for the problem and we use the state of the art ILP
solver, Gurobi, to obtain exact solution for moderate sized instances.Comment: Accepted and presented as a proceedings paper at SYNASC 201
Online Pattern Matching for String Edit Distance with Moves
Edit distance with moves (EDM) is a string-to-string distance measure that
includes substring moves in addition to ordinal editing operations to turn one
string to the other. Although optimizing EDM is intractable, it has many
applications especially in error detections. Edit sensitive parsing (ESP) is an
efficient parsing algorithm that guarantees an upper bound of parsing
discrepancies between different appearances of the same substrings in a string.
ESP can be used for computing an approximate EDM as the L1 distance between
characteristic vectors built by node labels in parsing trees. However, ESP is
not applicable to a streaming text data where a whole text is unknown in
advance. We present an online ESP (OESP) that enables an online pattern
matching for EDM. OESP builds a parse tree for a streaming text and computes
the L1 distance between characteristic vectors in an online manner. For the
space-efficient computation of EDM, OESP directly encodes the parse tree into a
succinct representation by leveraging the idea behind recent results of a
dynamic succinct tree. We experimentally test OESP on the ability to compute
EDM in an online manner on benchmark datasets, and we show OESP's efficiency.Comment: This paper has been accepted to the 21st edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2014
The Graph Motif problem parameterized by the structure of the input graph
The Graph Motif problem was introduced in 2006 in the context of biological
networks. It consists of deciding whether or not a multiset of colors occurs in
a connected subgraph of a vertex-colored graph. Graph Motif has been mostly
analyzed from the standpoint of parameterized complexity. The main parameters
which came into consideration were the size of the multiset and the number of
colors. Though, in the many applications of Graph Motif, the input graph
originates from real-life and has structure. Motivated by this prosaic
observation, we systematically study its complexity relatively to graph
structural parameters. For a wide range of parameters, we give new or improved
FPT algorithms, or show that the problem remains intractable. For the FPT
cases, we also give some kernelization lower bounds as well as some ETH-based
lower bounds on the worst case running time. Interestingly, we establish that
Graph Motif is W[1]-hard (while in W[P]) for parameter max leaf number, which
is, to the best of our knowledge, the first problem to behave this way.Comment: 24 pages, accepted in DAM, conference version in IPEC 201
On palimpsests in neural memory: an information theory viewpoint
The finite capacity of neural memory and the
reconsolidation phenomenon suggest it is important to be able
to update stored information as in a palimpsest, where new
information overwrites old information. Moreover, changing
information in memory is metabolically costly. In this paper, we
suggest that information-theoretic approaches may inform the
fundamental limits in constructing such a memory system. In
particular, we define malleable coding, that considers not only
representation length but also ease of representation update,
thereby encouraging some form of recycling to convert an old
codeword into a new one. Malleability cost is the difficulty of
synchronizing compressed versions, and malleable codes are of
particular interest when representing information and modifying
the representation are both expensive. We examine the tradeoff
between compression efficiency and malleability cost, under a
malleability metric defined with respect to a string edit distance.
This introduces a metric topology to the compressed domain. We
characterize the exact set of achievable rates and malleability as
the solution of a subgraph isomorphism problem. This is all done
within the optimization approach to biology framework.Accepted manuscrip
- …