Search CORE

978 research outputs found

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

Author: Bellare Kedar
McCallum Andrew
Pereira Fernando
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2005
Field of study

The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finitestate conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets

CiteSeerX

ScholarWorks@UMass Amherst

The k-mismatch problem revisited

Author: Clifford Raphaël
Fontaine Allyx
Porat Ely
Sach Benjamin
Starikovskaya Tatiana
Publication venue
Publication date: 27/08/2015
Field of study

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic

O(n k^2\log{k} / m+n \text{polylog} m)

time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only

O(k^2\text{polylog} {m})

space in total. 3) Next we give a randomised

(1+\epsilon)

-approximation algorithm for the streaming k-mismatch problem which uses

O(k^2\text{polylog} m / \epsilon^2)

space and runs in

O(\text{polylog} m / \epsilon^2)

worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised

O(k^2\text{polylog} {m})

space algorithm for the streaming k-mismatch problem which runs in

O(\sqrt{k}\log{k} + \text{polylog} {m})

worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors)

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

A Simple Algorithm for Approximating the Text-To-Pattern Hamming Distance

Author: Kopelowitz Tsvi
Porat Ely
Publication venue: OASIcs - OpenAccess Series in Informatics. 1st Symposium on Simplicity in Algorithms (SOSA 2018)
Publication date: 01/01/2018
Field of study

The algorithmic task of computing the Hamming distance between a given pattern of length m and each location in a text of length n, both over a general alphabet Sigma, is one of the most fundamental algorithmic tasks in string algorithms. The fastest known runtime for exact computation is tilde O(nsqrt m). We recently introduced a complicated randomized algorithm for obtaining a (1 +/- eps) approximation for each location in the text in O( (n/eps) log(1/eps) log n log m log |Sigma|) total time, breaking a barrier that stood for 22 years. In this paper, we introduce an elementary and simple randomized algorithm that takes O((n/eps) log n log m) time

Dagstuhl Research Online Publication Server

Approximate Hamming distance in a stream

Author: Clifford Raphael
Starikovskaya Tatiana
Publication venue
Publication date: 01/01/2016
Field of study

We consider the problem of computing a

(1+\epsilon)

-approximation of the Hamming distance between a pattern of length

n

and successive substrings of a stream. We first look at the one-way randomised communication complexity of this problem, giving Alice the first half of the stream and Bob the second half. We show the following: (1) If Alice and Bob both share the pattern then there is an

O(\epsilon^{-4} \log^2 n)

bit randomised one-way communication protocol. (2) If only Alice has the pattern then there is an

O(\epsilon^{-2}\sqrt{n}\log n)

bit randomised one-way communication protocol. We then go on to develop small space streaming algorithms for

(1+\epsilon)

-approximate Hamming distance which give worst case running time guarantees per arriving symbol. (1) For binary input alphabets there is an

O(\epsilon^{-3} \sqrt{n} \log^{2} n)

space and

O(\epsilon^{-2} \log{n})

time streaming

(1+\epsilon)

-approximate Hamming distance algorithm. (2) For general input alphabets there is an

O(\epsilon^{-5} \sqrt{n} \log^{4} n)

space and

O(\epsilon^{-4} \log^3 {n})

time streaming

(1+\epsilon)

-approximate Hamming distance algorithm.Comment: Submitted to ICALP' 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Explore Bristol Research

Distributed Algorithm for Parallel Edit Distance Computation

Author: Sadiq Muhammad Umair
Yousaf Muhammad Murtaza
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 12/01/2021
Field of study

The edit distance is the measure that quantifies the difference between two strings. It is an important concept because it has its usage in many domains such as natural language processing, spell checking, genome matching, and pattern recognition. Edit distance is also known as Levenshtein distance. Sequentially, the edit distance is computed by using dynamic programming based strategy that may not provide results in reasonable time when input strings are large. In this work, a distributed algorithm is presented for parallel edit distance computation. The proposed algorithm is both time and space efficient. It is evaluated on a hybrid setup of distributed and shared memory systems. Results suggest that the proposed algorithm achieves significant performance gain over the existing parallel approach

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Analysis of the Period Recovery Error Bound

Author: Amir Amihood
Boneh Itai
Itzhaki Michael
Kondratovsky Eitan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 28th Annual European Symposium on Algorithms (ESA 2020)
Publication date: 01/01/2020
Field of study

Dagstuhl Research Online Publication Server

Recommended from our members

A multiprocessor parallel approach to bit-parallel approximate string matching

Author: Chibli Elias Anwar
Publication venue: CSUSB ScholarWorks
Publication date: 01/01/2008
Field of study

The purpose of this project is to present with empirical results that a parallel design with the use of multiple processors can be successfully applied along with bit-parallel approximate string matching algorithms to solve practical bioinformatics problems. It will demonstrate that nearly optimal speedup can be achieved with a cluster of between two and eight workstations using MPI (Message Passing Interface), directly decreasing the total latency required to perform a string matching problem

CSUSB ScholarWorks