23 research outputs found
Edit Distance with Block Operations
We consider the problem of edit distance in which block operations are allowed, i.e. we ask for the minimal number of (block) operations that are needed to transform a string s to t. We give O(log n) approximation algorithms, where n is the total length of the input strings, for the variants of the problem which allow the following sets of operations: block move; block move and block delete; block move and block copy; block move, block copy, and block uncopy. The results still hold if we additionally allow any of the following operations: character insert, character delete, block reversal, or block involution (involution is a generalisation of the reversal). Previously, algorithms only for the first and last variant were known, and they had approximation ratios O(log n log^*n) and O(log n (log^*n)^2), respectively. The edit distance with block moves is equivalent, up to a constant factor, to the common string partition problem, in which we are given two strings s, t and the goal is to partition s into minimal number of parts such that they can be permuted in order to obtain t. Thus we also obtain an O(log n) approximation for this problem (compared to the previous O(log n log^* n)).
The results use a simplification of the previously used technique of locally consistent parsing, which groups short substrings of a string into phrases so that similar substrings are guaranteed to be grouped in a similar way. Instead of a sophisticated parsing technique relying on a deterministic coin tossing, we use a simple one based on a partition of the alphabet into two subalphabets. In particular, this lowers the running time from O(n log^* n) to O(n). The new algorithms (for block copy or block delete) use a similar algorithm, but the analysis is based on a specially tuned combinatorial function on sets of numbers
Online Pattern Matching for String Edit Distance with Moves
Edit distance with moves (EDM) is a string-to-string distance measure that
includes substring moves in addition to ordinal editing operations to turn one
string to the other. Although optimizing EDM is intractable, it has many
applications especially in error detections. Edit sensitive parsing (ESP) is an
efficient parsing algorithm that guarantees an upper bound of parsing
discrepancies between different appearances of the same substrings in a string.
ESP can be used for computing an approximate EDM as the L1 distance between
characteristic vectors built by node labels in parsing trees. However, ESP is
not applicable to a streaming text data where a whole text is unknown in
advance. We present an online ESP (OESP) that enables an online pattern
matching for EDM. OESP builds a parse tree for a streaming text and computes
the L1 distance between characteristic vectors in an online manner. For the
space-efficient computation of EDM, OESP directly encodes the parse tree into a
succinct representation by leveraging the idea behind recent results of a
dynamic succinct tree. We experimentally test OESP on the ability to compute
EDM in an online manner on benchmark datasets, and we show OESP's efficiency.Comment: This paper has been accepted to the 21st edition of the International
Symposium on String Processing and Information Retrieval (SPIRE2014
Computational Performance Evaluation of Two Integer Linear Programming Models for the Minimum Common String Partition Problem
In the minimum common string partition (MCSP) problem two related input
strings are given. "Related" refers to the property that both strings consist
of the same set of letters appearing the same number of times in each of the
two strings. The MCSP seeks a minimum cardinality partitioning of one string
into non-overlapping substrings that is also a valid partitioning for the
second string. This problem has applications in bioinformatics e.g. in
analyzing related DNA or protein sequences. For strings with lengths less than
about 1000 letters, a previously published integer linear programming (ILP)
formulation yields, when solved with a state-of-the-art solver such as CPLEX,
satisfactory results. In this work, we propose a new, alternative ILP model
that is compared to the former one. While a polyhedral study shows the linear
programming relaxations of the two models to be equally strong, a comprehensive
experimental comparison using real-world as well as artificially created
benchmark instances indicates substantial computational advantages of the new
formulation.Comment: arXiv admin note: text overlap with arXiv:1405.5646 This paper
version replaces the one submitted on January 10, 2015, due to detected error
in the calculation of the variables involved in the ILP model
Space-efficient Feature Maps for String Alignment Kernels
String kernels are attractive data analysis tools for analyzing string data.
Among them, alignment kernels are known for their high prediction accuracies in
string classifications when tested in combination with SVM in various
applications. However, alignment kernels have a crucial drawback in that they
scale poorly due to their quadratic computation complexity in the number of
input strings, which limits large-scale applications in practice. We address
this need by presenting the first approximation for string alignment kernels,
which we call space-efficient feature maps for edit distance with moves
(SFMEDM), by leveraging a metric embedding named edit sensitive parsing (ESP)
and feature maps (FMs) of random Fourier features (RFFs) for large-scale string
analyses. The original FMs for RFFs consume a huge amount of memory
proportional to the dimension d of input vectors and the dimension D of output
vectors, which prohibits its large-scale applications. We present novel
space-efficient feature maps (SFMs) of RFFs for a space reduction from O(dD) of
the original FMs to O(d) of SFMs with a theoretical guarantee with respect to
concentration bounds. We experimentally test SFMEDM on its ability to learn SVM
for large-scale string classifications with various massive string data, and we
demonstrate the superior performance of SFMEDM with respect to prediction
accuracy, scalability and computation efficiency.Comment: Full version for ICDM'19 pape
siEDM: an efficient string index and search algorithm for edit distance with moves
Although several self-indexes for highly repetitive text collections exist,
developing an index and search algorithm with editing operations remains a
challenge. Edit distance with moves (EDM) is a string-to-string distance
measure that includes substring moves in addition to ordinal editing operations
to turn one string into another. Although the problem of computing EDM is
intractable, it has a wide range of potential applications, especially in
approximate string retrieval. Despite the importance of computing EDM, there
has been no efficient method for indexing and searching large text collections
based on the EDM measure. We propose the first algorithm, named string index
for edit distance with moves (siEDM), for indexing and searching strings with
EDM. The siEDM algorithm builds an index structure by leveraging the idea
behind the edit sensitive parsing (ESP), an efficient algorithm enabling
approximately computing EDM with guarantees of upper and lower bounds for the
exact EDM. siEDM efficiently prunes the space for searching query strings by
the proposed method, which enables fast query searches with the same guarantee
as ESP. We experimentally tested the ability of siEDM to index and search
strings on benchmark datasets, and we showed siEDM's efficiency.Comment: 23 page
Pairwise sequence alignment with block and character edit operations
Pairwise sequence comparison is one of the most fundamental problems in
string processing. The most common metric to quantify the similarity between
sequences S and T is edit distance, d(S,T), which corresponds to the number of
characters that need to be substituted, deleted from, or inserted into S to
generate T. However, fewer edit operations may be sufficient for some string
pairs to transform one string to the other if larger rearrangements are
permitted. Block edit distance refers to such changes in substring level (i.e.,
blocks) that "penalizes" entire block removals, insertions, copies, and
reversals with the same cost as single-character edits (Lopresti & Tomkins,
1997). Most studies to calculate block edit distance to date aimed only to
characterize the distance itself for applications in sequence nearest neighbor
search without reporting the full alignment details. Although a few tools try
to solve block edit distance for genomic sequences, such as GR-Aligner, they
have limited functionality and are no longer maintained.
Here, we present SABER, an algorithm to solve block edit distance that
supports block deletions, block moves, and block reversals in addition to the
classical single-character edit operations. Our algorithm runs in
O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of
l_range; and can report all breakpoints for the block operations. We also
provide an implementation of SABER currently optimized for genomic sequences
(i.e., generated by the DNA alphabet), although the algorithm can theoretically
be used for any alphabet.
SABER is available at http://github.com/BilkentCompGen/sabe
On the k-Hamming and k-Edit Distances
In this paper we consider the weighted k-Hamming and k-Edit distances, that are natural generalizations
of the classical Hamming and Edit distances. As main results of this paper we prove that for any k ≥ 2
the DECIS-k-Hamming problem is P-SPACE-complete and the DECIS-k-Edit problem is NEXPTIMEcomplete. In our formulation, weights are included in the instance description and the cost is not
uniform
Alignments with non-overlapping moves, inversions and tandem duplications in O ( n 4) time
Sequence alignment is a central problem in bioinformatics. The classical dynamic programming algorithm aligns two sequences by optimizing over possible insertions, deletions and substitutions. However, other evolutionary events can be observed, such as inversions, tandem duplications or moves (transpositions). It has been established that the extension of the problem to move operations is NP-complete. Previous work has shown that an extension restricted to non-overlapping inversions can be solved in O(n 3) with a restricted scoring scheme. In this paper, we show that the alignment problem extended to non-overlapping moves can be solved in O(n 5) for general scoring schemes, O(n 4log n) for concave scoring schemes and O(n 4) for restricted scoring schemes. Furthermore, we show that the alignment problem extended to non-overlapping moves, inversions and tandem duplications can be solved with the same time complexities. Finally, an example of an alignment with non-overlapping moves is provide
Approximating reversal distance for strings with bounded number of duplicates
AbstractFor a string A=a1…an, a reversal ρ(i,j), 1⩽i⩽j⩽n, transforms the string A into a string A′=a1…ai-1ajaj-1…aiaj+1… an, that is, the reversal ρ(i,j) reverses the order of symbols in the substring ai…aj of A. In the case of signed strings, where each symbol is given a sign + or -, the reversal operation also flips the sign of each symbol in the reversed substring. Given two strings, A and B, signed or unsigned, sorting by reversals (SBR) is the problem of finding the minimum number of reversals that transform the string A into the string B.Traditionally, the problem was studied for permutations, that is, for strings in which every symbol appears exactly once. We consider a generalization of the problem, k-SBR, and allow each symbol to appear at most k times in each string, for some k⩾1. The main result of the paper is an O(k2)-approximation algorithm running in time O(n). For instances with 3<k⩽O(lognlog*n), this is the best known approximation algorithm for k-SBRand, moreover, it is faster than the previous best approximation algorithm