2,640 research outputs found

    Towards a Definitive Measure of Repetitiveness

    Get PDF
    Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ

    Towards a Definitive Compressibility Measure for Repetitive Sequences

    Get PDF
    Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size zz of the Lempel--Ziv parse are frequently used to estimate it. The size bzb \le z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size γ\gamma of the smallest string \emph{attractor}, was introduced. The measure γb\gamma \le b lower bounds all the previous relevant ones, yet length-nn strings can be represented and efficiently indexed within space O(γlognγ)O(\gamma\log\frac{n}{\gamma}), which also upper bounds most measures. While γ\gamma is certainly a better measure of repetitiveness than bb, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in o(γlogn)o(\gamma\log n) space. In this paper, we study an even smaller measure, δγ\delta \le \gamma, which can be computed in linear time, is monotonic, and allows encoding every string in O(δlognδ)O(\delta\log\frac{n}{\delta}) space because z=O(δlognδ)z = O(\delta\log\frac{n}{\delta}). We show that δ\delta better captures the compressibility of repetitive strings. Concretely, we show that (1) δ\delta can be strictly smaller than γ\gamma, by up to a logarithmic factor; (2) there are string families needing Ω(δlognδ)\Omega(\delta\log\frac{n}{\delta}) space to be encoded, so this space is optimal for every nn and δ\delta; (3) one can build run-length context-free grammars of size O(δlognδ)O(\delta\log\frac{n}{\delta}), whereas the smallest (non-run-length) grammar can be up to Θ(logn/loglogn)\Theta(\log n/\log\log n) times larger; and (4) within O(δlognδ)O(\delta\log\frac{n}{\delta}) space we can not only..

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function STS_T counting the cardinalities of the sets of substrings of each length of TT, also known as the substring complexity. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Ω(n)\Omega(n) working space. Kociumaka et al. showed that if δ\delta is given, one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? It is straightforward to show that any algorithm computing δ\delta using O(b)\mathcal{O}(b) space requires Ω(n2o(1)/b)\Omega(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an O(n3/b2)\mathcal{O}(n^3/b^2)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[1,n]b\in[1,n]; and an O~(n2/b)\tilde{\mathcal{O}}(n^2/b)-time and O(b)\mathcal{O}(b)-space algorithm to compute δ\delta, for any b[n2/3,n]b\in[n^{2/3},n]

    L-Systems for Measuring Repetitiveness

    Get PDF
    In order to use them for compression, we extend L-systems (without ?-rules) with two parameters d and n, and also a coding ?, which determines unambiguously a string w = ?(?^d(s))[1:n], where ? is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as ?, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper, we deepen the study of the measure ? and its relation with a better-established measure called ?, which builds on substring complexity. Our results show that ? and ? are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness. We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ? of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms

    Substring Complexity in Sublinear Space

    Get PDF
    Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - O((n3log b)/b2) time using O(b) space, for any b ∈ [1,n], in the comparison model. - Õ(n2/b) time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n^{1+ε})-time and Õ(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities

    Feeling good about myself : real-time hermeneutics and its consequences

    Get PDF
    Questions concerning the way in which digital games produce meaning and the possibility that their reconfigurability influences the process of interpretation have been debated since the very beginning of contemporary game studies. Based on general agreement between scholars, two areas of inquiry have been distinguished: the story produced by a game, and game mechanics, or rather all the information necessary to operate within them. The so-called "Game vs. Story division" has been analysed from multiple perspectives and theoretical standpoints. Among the scholars adopting the hermeneutical angle, there seems to be a consensus regarding the two distinct interpretative processes that occur while a game is played, although they do not agree about which should be considered the primary one. Scholars arguing for the unique character of digital games tend to focus on the interpretation created while the game is played that relates to aspects of gameplay. They stress the importance of so-called "real-time hermeneutics", as this is unprecedented in other media. In turn, researchers questioning the specificity of games as a medium claim that a proper interpretation should concern itself with the stories produced through playing, rendering such interpretation similar to every other hermeneutical process. Therefore, the process of understanding a game could be explained within the existing hermeneutical framework without any need to introduce media-specific interventions. In this paper, I will investigate the process of understanding video games, following the detailed, step-by-step description of interpretation provided by Paul Ricoeur in his American lectures. In doing so, I will supplement the concept of "real life hermeneutics" by narrowing the gap between interpreting game stories and gameplay situations. While such a perspective will bring me closer to a stance which denies any specificity to video games (at least regarding interpretation), I will also describe the key difference between understanding a video game and a traditional text, and briefly point towards its possible consequences, building upon Charles Taylor’s concept of ethics of authenticity

    Computing NP-Hard Repetitiveness Measures via MAX-SAT

    Get PDF
    Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors

    Organisational Change in Europe: National Models or the Diffusion of a New "One Best Way"?

    Get PDF
    Drawing on the results of the third European Survey on Working Conditions undertaken in the 15 member nations of the European Union in 2000, this paper offers one of the first systematic comparisons of the adoption of new organisation forms across Europe. The paper is divided into five sections. The first describe the variables used to characterise work organisation in the 15 countries of the European Union and presents the results of the factor analysis and hierarchical clustering used to construct a 4-way typology of organisational forms, labelled the 'learning , 'lean , 'taylorist and 'traditional forms. The second section examines how the relative importance of the different organisational forms varies according to sector, firm size, occupational category, and certain demographic characteristics of the survey population. The third section makes use of multinomial logit analysis to assess the importance of national effects in the adoption of the different organisational forms. The results demonstrate significant international differences in the adoption of organisational forms characterised by strong learning dynamics and high problem-solving activity. The fourth section takes up the issue of HRM complementarities by examining the relation between organisation forms and the use of particular pay and training policies. The concluding section explores the relation between national differences in the use of the four organisational forms and differences in the way labour markets are regulated and in such research and technology measures as patenting and R&D expenditures. The results show that the relative importance of the learning form of organisation is both positively correlated with the extent of labour market regulation, as measured by the OECD's overall employment protection legislation index, and with innovative performance, as measured by the number of EPO patent application per million inhabitants.Firm organisation; learning; Europe

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Novel Results on the Number of Runs of the Burrows-Wheeler-Transform

    Full text link
    The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is one of the fundamental components of many current data structures in string processing. It is central in data compression, as well as in efficient query algorithms for sequence data, such as webpages, genomic and other biological sequences, or indeed any textual data. The BWT lends itself well to compression because its number of equal-letter-runs (usually referred to as rr) is often considerably lower than that of the original string; in particular, it is well suited for strings with many repeated factors. In fact, much attention has been paid to the rr parameter as measure of repetitiveness, especially to evaluate the performance in terms of both space and time of compressed indexing data structures. In this paper, we investigate ρ(v)\rho(v), the ratio of rr and of the number of runs of the BWT of the reverse of vv. Kempa and Kociumaka [FOCS 2020] gave the first non-trivial upper bound as ρ(v)=O(log2(n))\rho(v) = O(\log^2(n)), for any string vv of length nn. However, nothing is known about the tightness of this upper bound. We present infinite families of binary strings for which ρ(v)=Θ(logn)\rho(v) = \Theta(\log n) holds, thus giving the first non-trivial lower bound on ρ(n)\rho(n), the maximum over all strings of length nn. Our results suggest that rr is not an ideal measure of the repetitiveness of the string, since the number of repeated factors is invariant between the string and its reverse. We believe that there is a more intricate relationship between the number of runs of the BWT and the string's combinatorial properties.Comment: 14 pages, 2 figue
    corecore