Search CORE

10,864 research outputs found

Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

Author: Flach PA
Price S
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/09/2008
Field of study

A practical index for approximate dictionary matching with few mismatches

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue
Publication date: 11/02/2016
Field of study

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in

q

-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

arXiv.org e-Print Archive

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

String Indexing for Patterns with Wildcards

Author: A. Tam
B. Chazelle
D. Harel
D. Tsur
G. Chen
G. Landau
G. Landau
G. Navarro
H.L. Chan
K. Hofmann
L.P. Coelho
M. Lewenstein
M. Maas
M.L. Fredman
P. Bille
P. Bille
P. Clifford
T.-W. Lam
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We consider the problem of indexing a string

t

of length

n

to report the occurrences of a query pattern

p

containing

m

characters and

j

wildcards. Let

occ

be the number of occurrences of

p

t

, and

\sigma

the size of the alphabet. We obtain the following results. - A linear space index with query time

O(m+\sigma^j \log \log n + occ)

. This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time

\Theta(jn)

in the worst case. - An index with query time

O(m+j+occ)

using space

O(\sigma^{k^2} n \log^k \log n)

, where

k

is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time. - A time-space trade-off, generalizing the index by Cole et al. [STOC 2004]. We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology

Upper and lower bounds for dynamic data structures on strings

Author: Clifford Raphael
Grønlund Allan
Larsen Kasper Green
Starikovskaya Tatiana
Publication venue
Publication date: 01/01/2018
Field of study

We consider a range of simply stated dynamic data structure problems on strings. An update changes one symbol in the input and a query asks us to compute some function of the pattern of length

m

and a substring of a longer text. We give both conditional and unconditional lower bounds for variants of exact matching with wildcards, inner product, and Hamming distance computation via a sequence of reductions. As an example, we show that there does not exist an

O(m^{1/2-\varepsilon})

time algorithm for a large range of these problems unless the online Boolean matrix-vector multiplication conjecture is false. We also provide nearly matching upper bounds for most of the problems we consider.Comment: Accepted at STACS'1

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Explore Bristol Research

siEDM: an efficient string index and search algorithm for edit distance with moves

Author: Kuboyama Tetsuji
Nakashima Kenta
Sakamoto Hiroshi
Tabei Yasuo
Takabatake Yoshimasa
Publication venue
Publication date: 01/04/2016
Field of study

Although several self-indexes for highly repetitive text collections exist, developing an index and search algorithm with editing operations remains a challenge. Edit distance with moves (EDM) is a string-to-string distance measure that includes substring moves in addition to ordinal editing operations to turn one string into another. Although the problem of computing EDM is intractable, it has a wide range of potential applications, especially in approximate string retrieval. Despite the importance of computing EDM, there has been no efficient method for indexing and searching large text collections based on the EDM measure. We propose the first algorithm, named string index for edit distance with moves (siEDM), for indexing and searching strings with EDM. The siEDM algorithm builds an index structure by leveraging the idea behind the edit sensitive parsing (ESP), an efficient algorithm enabling approximately computing EDM with guarantees of upper and lower bounds for the exact EDM. siEDM efficiently prunes the space for searching query strings by the proposed method, which enables fast query searches with the same guarantee as ESP. We experimentally tested the ability of siEDM to index and search strings on benchmark datasets, and we showed siEDM's efficiency.Comment: 23 page

arXiv.org e-Print Archive

Directory of Open Access Journals