13 research outputs found

    Convergence of the number of period sets in strings

    Get PDF
    Consider words of length n. The set of all periods of a word of length n is a subset of {0,1,2,…,n−1}. However, any subset of {0,1,2,…,n−1} is not necessarily a valid set of periods. In a seminal paper in 1981, Guibas and Odlyzko have proposed to encode the set of periods of a word into an n long binary string, called an autocorrelation, where a one at position i denotes a period of i. They considered the question of recognizing a valid period set, and also studied the number of valid period sets for length n, denoted κ_n. They conjectured that ln(κ_n) asymptotically converges to a constant times ln^2(n). If improved lower bounds for ln(κ_n)/ln^2(n) were proposed in 2001, the question of a tight upper bound has remained opened since Guibas and Odlyzko's paper. Here, we exhibit an upper bound for this fraction, which implies its convergence and closes this long standing conjecture. Moreover, we extend our result to find similar bounds for the number of correlations: a generalization of autocorrelations which encodes the overlaps between two strings

    String Sanitization Under Edit Distance: Improved and Generalized

    Get PDF
    Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars

    String Sanitization Under Edit Distance: Improved and Generalized

    Get PDF
    Let W be a string of length n over an alphabet Σ, k be a positive integer, and S be a set of length-k substrings of W. The ETFS problem asks us to construct a string XED such that: (i) no string of S occurs in XED; (ii) the order of all other length-k substrings over Σ is the same in W and in XED; and (iii) XED has minimal edit distance to W. When W represents an individual's data and S represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in O(n2k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in O(n2−δ) time, for any δ>0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an O(n2log2k)-time algorithm to solve ETFS; and (ii) an O(n2log2n)-time algorithm to solve AETFS, a generalization of ETFS in which the elements of S can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars

    Constructing strings avoiding forbidden substrings

    Get PDF
    We consider the problem of constructing strings over an alphabet Σ that start with a given prefix u, end with a given suffix v, and avoid occurrences of a given set of forbidden substrings. In the decision version of the problem, given a set Sk of forbidden substrings, each of length k, over Σ, we are asked to decide whether there exists a string x over Σ such that u is a prefix of x, v is a suffix of x, and no s ϵ Sk occurs in x. Our first result is an O(|u| + |v| + k|Sk|)-time algorithm to decide this problem. In the more general optimization version of the problem, given a set S of forbidden arbitrary-length substrings over Σ, we are asked to construct a shortest string x over S such that u is a prefix of x, v is a suffix of x, and no s ϵ S occurs in x. Our second result is an O(|u| + |v| + ||S|| · |Σ|)-time algorithm to solve this problem, where ||S|| denotes the total length of the elements of S. Interestingly, our results can be directly applied to solve the reachability and shortest path problems in complete de Bruijn graphs in the presence of forbidden edges or of forbidden paths. Our algorithms are motivated by data privacy, and in particular, by the data sanitization process. In the context of strings, sanitization consists in hiding forbidden substrings from a given string by introducing the least amount of spurious information. We consider the following problem. Given a string w of length n over Σ, an integer k, and a set Sk of forbidden substrings, each of length k, over Σ, construct a shortest string y over Σ such that no s ϵ Sk occurs in y and the sequence of all other length-k fragments occurring in w is a subsequence of the sequence of the length-k fragments occurring in y. Our third result is an O(nk|Sk| · |Σ|)-time algorithm to solve this problem

    On breaking truss-based communities

    Get PDF
    A k-truss is a graph such that each edge is contained in at least k-2 triangles. This notion has attracted much attention, because it models meaningful cohesive subgraphs of a graph. We introduce the problem of identifying a smallest edge subset of a given graph whose removal makes the graph k-truss-free. We also introduce a problem variant where the identified subset contains only edges incident to a given set of nodes and ensures that these nodes are not contained in any k-truss. These problems are directly applicable in communication networks: the identified edges correspond to vital network connections; or in social networks: the identified edges can be hidden by users or sanitized from the output graph. We show that these problems are NP-hard. We thus develop exact exponential-time algorithms to solve them. To process large networks, we also develop heuristics sped up by an efficient data structure for updating the truss decomposition under edge deletions. We complement our heuristics with a lower bound on the size of an optimal solution to rigorously evaluate their effectiveness. Extensive experiments on 10 real-world graphs show that our heuristics are effective (close to the optimal or to the lower bound) and also efficient (up to two orders of magnitude faster than a natural baseline)

    A universal error measure for input predictions applied to online graph problems

    Get PDF
    We introduce a novel measure for quantifying the error in input predictions. The error is based on a minimum-cost hyperedge cover in a suitably defined hypergraph and provides a general template which we apply to online graph problems. The measure captures errors due to absent predicted requests as well as unpredicted actual requests; hence, predicted and actual inputs can be of arbitrary size. We achieve refined performance guarantees for previously studied network design problems in the online-list model, such as Steiner tree and facility location. Further, we initiate the study of learning-augmented algorithms for online routing problems, such as the online traveling salesperson problem and the online dial-a-ride problem, where (transportation) requests arrive over time (online-time model). We provide a general algorithmic framework and we give error-dependent performance bounds that improve upon known worst-case barriers, when given accurate predictions, at the cost of slightly increased worst-case bounds when given predictions of arbitrary quality

    Hide and mine in strings: Hardness, algorithms, and experiments

    Get PDF
    Data sanitization and frequent pattern mining are two well-studied topics in data mining. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well

    Elastic-degenerate string matching with 1 error

    Get PDF
    An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~(nmω−1)+O(N)-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω denotes the matrix multiplication exponent and the O~(⋅) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k2mG+kN) time under edit distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k=1, the existing algorithm runs in Ω(mN) time in the worst case. Here we make progress in this direction. We show that 1-EDSM can be solved in O((nm2+N)logm) or O(nm3+N) time under edit distance. For the decision version of the problem, we present a faster O(nm2logm−−−−−√+Nloglogm)-time algorithm. Our algorithms rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or range emptiness), which we show how to solve efficiently
    corecore