Search CORE

124 research outputs found

Pattern masking for dictionary matching

Author: Ahn Hee-Kap
Charalampopoulos Panagiotis
Chen Huiping
Christen Peter
Loukides Grigorios
Pisanti Nadia
Pissis Solon P.
Radoszewski Jakub
Sadakane Kunihiko
Publication venue
Publication date: 29/06/2020
Field of study

Data masking is a common technique for sanitizing sensitive data maintained in database systems, and it is also becoming increasingly important in various application areas, such as in record linkage of personal data. This work formalizes the Pattern Masking for Dictionary Matching (PMDM) problem. In PMDM, we are given a dictionary of d strings, each of length , a query string q of length , and a positive integer z, and we are asked to compute a smallest set K ⊆ {1,…,}, so that if q[i] is replaced by a wildcard for all i ∈ K, then q matches at least z strings from . Solving PMDM allows providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known k-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We thus approach the problem from a more practical perspective. We show a combinatorial ((d)^{|K|/3}+d)-time and (d)-space algorithm for PMDM for |K| = (1). In fact, we show that we cannot hope for a faster combinatorial algorithm, unless the combinatorial k-Clique hypothesis fails [Abboud et al., SIAM J. Comput. 2018; Lincoln et al., SODA 2018]. We also generalize this algorithm for the problem of masking multiple query strings simultaneously so that every string has at least z matches in . Note that PMDM can be viewed as a generalization of the decision version of the dictionary matching with mismatches problem: by querying a PMDM data structure with string q and z = 1, one obtains the minimal number of mismatches of q with any string from . The query time or space of all known data structures for the more restricted problem of dictionary matching with at most k mismatches incurs some exponential factor with respect to k. A simple exact algorithm for PMDM runs in time (2^ d). We present a data structure for PMDM that answers queries over in time (2^{/2}(2^{/2}+τ)) and requires space (2^ d²/τ²+2^{/2}d), for any parameter τ ∈ [1,d]. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time (d^{1/4+ε})-approximation algorithm for PMDM, which is tight under a plausible complexity conjecture. </p

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

King's Research Portal

Statistical physics of subgraph identification problem

Author: Bradde Serena
Publication venue: place:Trieste
Publication date: 30/07/2010
Field of study

Discovery of Unconventional Patterns for Sequence Analysis: Theory and Algorithms

Author: BATTAGLIA GIOVANNI
Publication venue: 'Pisa University Press'
Publication date: 19/12/2011
Field of study

The biology community is collecting a large amount of raw data, such as the genome sequences of organisms, microarray data, interaction data such as gene-protein interactions, protein-protein interactions, etc. This amount is rapidly increasing and the process of understanding the data is lagging behind the process of acquiring it. An inevitable first step towards making sense of the data is to study their regularities focusing on the non-random structures appearing surprisingly often in the input sequences: patterns. In this thesis we discuss three incarnations of the pattern discovery task, exploring three types of patterns that can model different regularities of the input dataset. While mask patterns have been designed to model short repeated biological sequences, showing a high conservation of their content at some specific positions, permutation patterns have been designed to detect repeated patterns whose parts maintain their physical adjacency but not their ordering in all the pattern occurrences. Transposons, instead, model mobile sequences in the input dataset, which can be discovered by comparing different copies of the same input string, detecting large insertions and deletions in their alignment