Search CORE

13 research outputs found

String Sanitization Under Edit Distance

Author: Bernardini Giulia
Chen Huiping
Gortz Inge Li
Loukides Grigorios
Pisanti Nadia
Pissis Solon P.
Stougie Leen
Sweering Michelle
Weimann Oren
Publication venue
Publication date: 01/01/2020
Field of study

Let W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS

Archivio istituzionale della ricerca - Università di Trieste

VU Research Portal

CWI's Institutional Repository

University of Birmingham Research Portal

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Combinatorial Algorithms for String Sanitization

Author: Bernardini Giulia
Chen Huiping
Conte Alessio
Grossi Roberto
Loukides Grigorios
Pisanti Nadia
Pissis Solon P.
Rosone Giovanna
Sweering Michelle
Publication venue
Publication date: 28/12/2019
Field of study

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge. In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility, in two settings that are relevant to many common string processing tasks. In the first setting, we aim to generate the minimal-length string that preserves the order of appearance and frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. To construct such a string, we propose a time-optimal algorithm, TFS-ALGO. We also propose another time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms are constructed by concatenating non-sensitive parts of the input string. However, it is possible to detect the sensitive patterns by ``reversing'' the concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in the strings output by the algorithms with carefully selected letters, so that sensitive patterns are not reinstated, implausible patterns are not introduced, and occurrences of spurious patterns are prevented. In the second setting, we aim to generate a string that is at minimal edit distance from the original string, in addition to preserving the order of appearance and frequency of all non-sensitive patterns. To construct such a string, we propose an algorithm, ETFS-ALGO, based on solving specific instances of approximate regular expression matching.Comment: Extended version of a paper accepted to ECML/PKDD 201

arXiv.org e-Print Archive

CWI's Institutional Repository

University of Birmingham Research Portal

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

String Sanitization: A Combinatorial Approach

Author: B Cazaux
CC Aggarwal
D Pissinger
J Gallant
M Crochemore
O Abul
R Grossi
SP Pissis
VS Verykios
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/04/2020
Field of study

Crossref

CWI's Institutional Repository

Constructing Strings Avoiding Forbidden Substrings

Author: Bernardini Giulia
Marchetti-Spaccamela Alberto
Pissis Solon P.
Stougie Leen
Sweering Michelle
Publication venue: HAL CCSD
Publication date: 05/07/2021
Field of study

International audienceWe consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set Sk of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ∈ Sk occurs in x. Our first result is an O(|u| + |v| + k|Sk|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ∈ S occurs in x. Our second result is an O(|u|+|v|+||S||·|Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S.Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths.Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set Sk of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no s ∈ Sk occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an O(nk|Sk | · |Σ|)-time algorithm to solve this problem

INRIA a CCSD electronic archive server

Constructing strings avoiding forbidden substrings

Author: Bernardini G. (Giulia)
Marchetti Spaccamela A. (Alberto)
Pissis S. (Solon)
Stougie L. (Leen)
Sweering M.J.M. (Michelle)
Publication venue
Publication date: 01/01/2021
Field of study

We consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set Sk of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ϵ Sk occurs in x. Our first result is an O(|u| + |v| + k|Sk|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over S such that u is a prefix of x, v is a suffix of x, and no s ϵ S occurs in x. Our second result is an O(|u| + |v| + ||S|| · |Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S. Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths. Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set Sk of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no s ϵ Sk occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an O(nk|Sk| · |Σ|)-time algorithm to solve this problem

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

String Sanitization Under Edit Distance: Improved and Generalized

Author: Mieno Takuya
Pissis Solon,
Stougie Leen
Sweering Michelle
Publication venue: HAL CCSD
Publication date: 05/07/2021
Field of study

International audienceLet W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string X ED such that: (i) no string of S occurs in X ED ; (ii) the order of all other length-k substrings over Σ is the same in W and in X ED ; and (iii) X ED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n 2 k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n 2−δ) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: • An O(n 2 log 2 k)-time algorithm to solve ETFS. • An O(n 2 log 2 n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths

INRIA a CCSD electronic archive server

String Sanitization Under Edit Distance: Improved and Generalized

Author: Mieno T. (Takuya)
Pissis S. (Solon)
Stougie L. (Leen)
Sweering M.J.M. (Michelle)
Publication venue
Publication date: 01/01/2020
Field of study

Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars

arXiv.org e-Print Archive

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

String Sanitization Under Edit Distance: Improved and Generalized

Author: Mieno T. (Takuya)
Pissis S. (Solon)
Stougie L. (Leen)
Sweering M.J.M. (Michelle)
Publication venue
Publication date: 01/01/2020
Field of study

CWI's Institutional Repository

Location histogram privacy by sensitive location hiding and target histogram avoidance/resemblance

Author: Loukides Grigorios
Theodorakopoulos Georgios
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 29/11/2019
Field of study

A location histogram is comprised of the number of times a user has visited locations as they move in an area of interest, and it is often obtained from the user in the context of applications such as recommendation and advertising. However, a location histogram that leaves a user's computer or device may threaten privacy when it contains visits to locations that the user does not want to disclose (sensitive locations), or when it can be used to profile the user in a way that leads to price discrimination and unsolicited advertising (e.g. as 'wealthy' or 'minority member'). Our work introduces two privacy notions to protect a location histogram from these threats: sensitive location hiding, which aims at concealing all visits to sensitive locations, and target avoidance/resemblance, which aims at concealing the similarity/dissimilarity of the user's histogram to a target histogram that corresponds to an undesired/desired profile. We formulate an optimization problem around each notion: Sensitive Location Hiding (SLH), which seeks to construct a histogram that is as similar as possible to the user's histogram but associates all visits with nonsensitive locations, and Target Avoidance/Resemblance (TA/TR), which seeks to construct a histogram that is as dissimilar/similar as possible to a given target histogram but remains useful for getting a good response from the application that analyzes the histogram. We develop an optimal algorithm for each notion, which operates on a notion-specific search space graph and finds a shortest or longest path in the graph that corresponds to a solution histogram. In addition, we develop a greedy heuristic for the TA/TR problem, which operates directly on a user's histogram. Our experiments demonstrate that all algorithms are effective at preserving the distribution of locations in a histogram and the quality of location recommendation. They also demonstrate that the heuristic produces near-optimal solutions while being orders of magnitude faster than the optimal algorithm for TA/TR

arXiv.org e-Print Archive

Online Research @ Cardiff

King's Research Portal