228 research outputs found
String Sanitization Under Edit Distance
Let W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Künnemann, FOCS 2015], to ETFS
String Sanitization Under Edit Distance: Improved and Generalized
Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars
String Sanitization Under Edit Distance: Improved and Generalized
International audienceLet W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string X ED such that: (i) no string of S occurs in X ED ; (ii) the order of all other length-k substrings over Σ is the same in W and in X ED ; and (iii) X ED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n 2 k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n 2−δ) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: • An O(n 2 log 2 k)-time algorithm to solve ETFS. • An O(n 2 log 2 n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths
String Sanitization Under Edit Distance: Improved and Generalized
Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars
Reverse-Safe Data Structures for Text Indexing
We introduce the notion of reverse-safe data structures. These are data structures that prevent the reconstruction of the data they encode (i.e., they cannot be easily reversed). A data structure D is called z-reverse-safe when there exist at least z datasets with the same set of answers as the ones stored by D. The main challenge is to ensure that D stores as many answers to useful queries as possible, is constructed efficiently, and has size close to the size of the original dataset it encodes. Given a text of length n and an integer z, we propose an algorithm which constructs a z-reverse-safe data structure that has size O(n) and answers pattern matching queries of length at most d optimally, where d is maximal for any such z-reverse-safe data structure. The construction algorithm takes O(n ω log d) time, where ω is the matrix multiplication exponent. We show that, despite the n ω factor, our engineered implementation takes only a few minutes to finish for million-letter texts. We further show that plugging our method in data analysis applications gives insignificant or no data utility loss. Finally, we show how our technique can be extended to support applications under a realistic adversary model
Combinatorial Algorithms for String Sanitization
String data are often disseminated to support applications such as
location-based service provision or DNA sequence analysis. This dissemination,
however, may expose sensitive patterns that model confidential knowledge. In
this paper, we consider the problem of sanitizing a string by concealing the
occurrences of sensitive patterns, while maintaining data utility, in two
settings that are relevant to many common string processing tasks.
In the first setting, we aim to generate the minimal-length string that
preserves the order of appearance and frequency of all non-sensitive patterns.
Such a string allows accurately performing tasks based on the sequential nature
and pattern frequencies of the string. To construct such a string, we propose a
time-optimal algorithm, TFS-ALGO. We also propose another time-optimal
algorithm, PFS-ALGO, which preserves a partial order of appearance of
non-sensitive patterns but produces a much shorter string that can be analyzed
more efficiently. The strings produced by either of these algorithms are
constructed by concatenating non-sensitive parts of the input string. However,
it is possible to detect the sensitive patterns by ``reversing'' the
concatenation operations. In response, we propose a heuristic, MCSR-ALGO, which
replaces letters in the strings output by the algorithms with carefully
selected letters, so that sensitive patterns are not reinstated, implausible
patterns are not introduced, and occurrences of spurious patterns are
prevented. In the second setting, we aim to generate a string that is at
minimal edit distance from the original string, in addition to preserving the
order of appearance and frequency of all non-sensitive patterns. To construct
such a string, we propose an algorithm, ETFS-ALGO, based on solving specific
instances of approximate regular expression matching.Comment: Extended version of a paper accepted to ECML/PKDD 201
Matching Patterns with Variables Under Edit Distance
A pattern is a string of variables and terminal letters. We say that
matches a word , consisting only of terminal letters, if can be
obtained by replacing the variables of by terminal words. The matching
problem, i.e., deciding whether a given pattern matches a given word, was
heavily investigated: it is NP-complete in general, but can be solved
efficiently for classes of patterns with restricted structure. If we are
interested in what is the minimum Hamming distance between and any word
obtained by replacing the variables of by terminal words (so matching
under Hamming distance), one can devise efficient algorithms and matching
conditional lower bounds for the class of regular patterns (in which no
variable occurs twice), as well as for classes of patterns where we allow
unbounded repetitions of variables, but restrict the structure of the pattern,
i.e., the way the occurrences of different variables can be interleaved.
Moreover, under Hamming distance, if a variable occurs more than once and its
occurrences can be interleaved arbitrarily with those of other variables, even
if each of these occurs just once, the matching problem is intractable. In this
paper, we consider the problem of matching patterns with variables under edit
distance. We still obtain efficient algorithms and matching conditional lower
bounds for the class of regular patterns, but show that the problem becomes, in
this case, intractable already for unary patterns, consisting of repeated
occurrences of a single variable interleaved with terminals
Search-based Multi-Vulnerability Testing of XML Injections in Web Applications
Modern web applications often interact with internal web services, which are not directly accessible to users. However, malicious user inputs can be used to exploit security vulnerabilities in web services through the application front-ends. Therefore, testing techniques have been proposed to reveal security flaws in the interactions with back-end web services, e.g., XML Injections (XMLi). Given a potentially malicious message between a web application and web services, search-based techniques have been used to find input data to mislead the web application into sending such a message, possibly compromising the target web service. However, state-of-the-art techniques focus on (search for) one single malicious message at a time.
Since, in practice, there can be many different kinds of malicious messages, with only a few of them which can possibly be generated by a given front-end, searching for one single message at a time is ineffective and may not scale. To overcome these limitations, we propose a novel co-evolutionary algorithm (COMIX) that is tailored to our problem and uncover multiple vulnerabilities at the same time. Our experiments show that COMIX outperforms a single-target search approach for XMLi and other multi-target search algorithms originally defined for white-box unit testing
- …