Search CORE

8 research outputs found

A practical index for approximate dictionary matching with few mismatches

Author: Cisłak Aleksander
Grabowski Szymon
Publication venue
Publication date: 11/02/2016
Field of study

Approximate dictionary matching is a classic string matching problem (checking if a query string occurs in a collection of strings) with applications in, e.g., spellchecking, online catalogs, geolocation, and web searchers. We present a surprisingly simple solution called a split index, which is based on the Dirichlet principle, for matching a keyword with few mismatches, and experimentally show that it offers competitive space-time tradeoffs. Our implementation in the C++ language is focused mostly on data compaction, which is beneficial for the search speed (e.g., by being cache friendly). We compare our solution with other algorithms and we show that it performs better for the Hamming distance. Query times in the order of 1 microsecond were reported for one mismatch for the dictionary size of a few megabytes on a medium-end PC. We also demonstrate that a basic compression technique consisting in

q

-gram substitution can significantly reduce the index size (up to 50% of the input text size for the DNA), while still keeping the query time relatively low

arXiv.org e-Print Archive

Crossref

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Simple, compact and robust approximate string dictionary

Author: Belazzougui Djamal
Chegrane Ibrahim
Publication venue
Publication date: 22/08/2014
Field of study

This paper is concerned with practical implementations of approximate string dictionaries that allow edit errors. In this problem, we have as input a dictionary

D

d

strings of total length

n

over an alphabet of size

\sigma

. Given a bound

k

and a pattern

x

of length

m

, a query has to return all the strings of the dictionary which are at edit distance at most

k

from

x

, where the edit distance between two strings

x

and

y

is defined as the minimum-cost sequence of edit operations that transform

x

into

y

. The cost of a sequence of operations is defined as the sum of the costs of the operations involved in the sequence. In this paper, we assume that each of these operations has unit cost and consider only three operations: deletion of one character, insertion of one character and substitution of a character by another. We present a practical implementation of the data structure we recently proposed and which works only for one error. We extend the scheme to

2\leq k<m

. Our implementation has many desirable properties: it has a very fast and space-efficient building algorithm. The dictionary data structure is compact and has fast and robust query time. Finally our data structure is simple to implement as it only uses basic techniques from the literature, mainly hashing (linear probing and hash signatures) and succinct data structures (bitvectors supporting rank queries).Comment: Accepted to a journal (19 pages, 2 figures

arXiv.org e-Print Archive

CiteSeerX

Fast Entropy-Bounded String Dictionary Look-Up with Mismatches

Author: Gawrychowski Pawel
Landau Gad M.
Starikovskaya Tatiana
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 43rd International Symposium on Mathematical Foundations of Computer Science (MFCS 2018)
Publication date: 01/01/2018
Field of study

We revisit the fundamental problem of dictionary look-up with mismatches. Given a set (dictionary) of d strings of length m and an integer k, we must preprocess it into a data structure to answer the following queries: Given a query string Q of length m, find all strings in the dictionary that are at Hamming distance at most k from Q. Chan and Lewenstein (CPM 2015) showed a data structure for k = 1 with optimal query time O(m/w + occ), where w is the size of a machine word and occ is the size of the output. The data structure occupies O(w d log^{1+epsilon} d) extra bits of space (beyond the entropy-bounded space required to store the dictionary strings). In this work we give a solution with similar bounds for a much wider range of values k. Namely, we give a data structure that has O(m/w + log^k d + occ) query time and uses O(w d log^k d) extra bits of space

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Compressed String Dictionary Search with Edit Distance One

Author: Belazzougui Djamal
Venturini Rossano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 25/03/2015
Field of study

In this paper we present different solutions for the problem of indexing a dictionary of strings in compressed space. Given a pattern (Formula presented.) , the index has to report all the strings in the dictionary having edit distance at most one with (Formula presented.). Our first solution is able to solve queries in (almost optimal) (Formula presented.) time where (Formula presented.) is the number of strings in the dictionary having edit distance at most one with (Formula presented.). The space complexity of this solution is bounded in terms of the (Formula presented.) th order entropy of the indexed dictionary. A second solution further improves this space complexity at the cost of increasing the query time. Finally, we propose randomized solutions (Monte Carlo and Las Vegas) which achieve simultaneously the time complexity of the first solution and the space complexity of the second one

Crossref

Archivio della Ricerca - Università di Pisa

Pattern Masking for Dictionary Matching:Theory and Practice

Author: Charalampopoulos Panagiotis
Chen Huiping
Christen Peter
Loukides Grigorios
Pisanti Nadia
Pissis Solon P.
Radoszewski Jakub
Publication venue
Publication date: 01/06/2024
Field of study

Data masking is a common technique for sanitizing sensitive data maintained in database systems which is becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem: given a dictionary D of d strings, each of length ℓ, a query string q of length ℓ, and a positive integer z, we are asked to compute a smallest set K⊆{1, …, ℓ}, so that if q[i] is replaced by a wildcard for all i∈K, then q matches at least z strings from D. Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for binary strings. We thus approach the problem from a more practical perspective. We show a combinatorial O((dℓ)|K|/3+dℓ)-time and O(dℓ)-space algorithm for PMDM for |K|=O(1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails (Abboud et al. in SIAM J Comput 47:2527–2555, 2018; Lincoln et al., in: 29th ACM-SIAM Symposium on Discrete Algorithms (SODA), 2018). Our combinatorial algorithm, executed with small |K|, is the backbone of a greedy heuristic that we propose. Our experiments on real-world and synthetic datasets show that our heuristic finds nearly-optimal solutions in practice and is also very efficient. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in D. PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z=1, one obtains the minimal number of mismatches of q with any string from D. The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time O(2ℓd). We present a data structure for PMDM that answers queries over D in time O(2ℓ/2(2ℓ/2+τ)ℓ) and requires space O(2ℓd2/τ2+2ℓ/2d), for any parameter τ∈[1, d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., ACM-SIAM Symposium on Discrete Algorithms (SODA) 2017]. This gives a polynomial-time O(d1/4+ϵ)-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. This is an extended version of a paper that was presented at International Symposium on Algorithms and Computation (ISAAC) 2021

University of Birmingham Research Portal

Pattern masking for dictionary matching

Author: Ahn Hee-Kap
Charalampopoulos Panagiotis
Chen Huiping
Christen Peter
Loukides Grigorios
Pisanti Nadia
Pissis Solon P.
Radoszewski Jakub
Sadakane Kunihiko
Publication venue
Publication date: 29/06/2020
Field of study

Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary of d strings, each of length , a query string q of length , and a positive integer z, and we are asked to compute a smallest set K ⊆ {1,…,}, so that if q[i] is replaced by a wildcard for all i ∈ K, then q matches at least z strings from . Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We thus approach the problem from a more practical perspective. We show a combinatorial ((d)^{|K|/3}+d)-time and (d)-space algorithm for PMDM for |K| = (1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails [Abboud et al., SIAM J. Comput. 2018; Lincoln et al., SODA 2018]. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in . Note that PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from . The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time (2^ d). We present a data structure for PMDM that answers queries over in time (2^{/2}(2^{/2}+τ)) and requires space (2^ d²/τ²+2^{/2}d), for any parameter τ ∈ [1,d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time (d^{1/4+ε})-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. </p

arXiv.org e-Print Archive

VU Research Portal

CWI's Institutional Repository

University of Birmingham Research Portal

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

King's Research Portal

43rd International Symposium on Mathematical Foundations of Computer Science: MFCS 2018, August 27-31, 2018, Liverpool, United Kingdom

Author: International Symposium on Mathematical Foundations of Computer Science <43. 2018, Liverpool>
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing
Publication date: 01/08/2018
Field of study

Digitale Bibliothek Thüringen