2,640 research outputs found
Towards a Definitive Measure of Repetitiveness
Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space O(γlognγ), which also upper bounds most measures (including z). While γ is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no o(γlog n) -space representation of strings is known. In this paper, we study a smaller measure, δ≤ γ, which can be computed in linear time. We show that δ better captures the compressibility of repetitive strings. For every length n and every value δ≥ 2, we construct a string such that γ=Ω(δlognδ). Still, we show a representation of any string S in O(δlognδ) space that supports direct access to any character S[i] in time O(lognδ) and finds the occ occurrences of any pattern P[1.m] in time O(mlog n+ occlogεn) for any constant ε> 0. Further, we prove that no o(δlog n) -space representation exists: for every length n and every value 2 ≤ δ≤ n1-ε, we exhibit a string family whose elements can only be encoded in Ω(δlognδ) space. We complete our characterization of δ by showing that, although γ, z, and other repetitiveness measures are always O(δlognδ), for strings of any length n, the smallest context-free grammar can be of size Ω(δlog2n/ log log n). No such separation is known for γ
Towards a Definitive Compressibility Measure for Repetitive Sequences
Unlike in statistical compression, where Shannon's entropy is a definitive
lower bound, no such clear measure exists for the compressibility of repetitive
sequences. Since statistical entropy does not capture repetitiveness, ad-hoc
measures like the size of the Lempel--Ziv parse are frequently used to
estimate it. The size of the smallest bidirectional macro scheme
captures better what can be achieved via copy-paste processes, though it is
NP-complete to compute and it is not monotonic upon symbol appends. Recently, a
more principled measure, the size of the smallest string
\emph{attractor}, was introduced. The measure lower bounds all
the previous relevant ones, yet length- strings can be represented and
efficiently indexed within space , which also
upper bounds most measures. While is certainly a better measure of
repetitiveness than , it is also NP-complete to compute and not monotonic,
and it is unknown if one can always represent a string in
space.
In this paper, we study an even smaller measure, , which
can be computed in linear time, is monotonic, and allows encoding every string
in space because . We show that better captures the
compressibility of repetitive strings. Concretely, we show that (1)
can be strictly smaller than , by up to a logarithmic factor; (2) there
are string families needing space to be
encoded, so this space is optimal for every and ; (3) one can build
run-length context-free grammars of size ,
whereas the smallest (non-run-length) grammar can be up to times larger; and (4) within
space we can not only..
Substring Complexity in Sublinear Space
Shannon's entropy is a definitive lower bound for statistical compression.
Unfortunately, no such clear measure exists for the compressibility of
repetitive strings. Thus, ad-hoc measures are employed to estimate the
repetitiveness of strings, e.g., the size of the Lempel-Ziv parse or the
number of equal-letter runs of the Burrows-Wheeler transform. A more recent
one is the size of a smallest string attractor. Unfortunately, Kempa
and Prezza [STOC 2018] showed that computing is NP-hard. Kociumaka et
al. [LATIN 2020] considered a new measure that is based on the function
counting the cardinalities of the sets of substrings of each length of ,
also known as the substring complexity. This new measure is defined as and lower bounds all the measures previously
considered. In particular, always holds and can be
computed in time using working space. Kociumaka et
al. showed that if is given, one can construct an -sized representation of supporting efficient direct
access and efficient pattern matching queries on . Given that for highly
compressible strings, is significantly smaller than , it is natural
to pose the following question: Can we compute efficiently using
sublinear working space?
It is straightforward to show that any algorithm computing using
space requires time through a reduction
from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present
the following results: an -time and
-space algorithm to compute , for any ; and
an -time and -space algorithm to
compute , for any
L-Systems for Measuring Repetitiveness
In order to use them for compression, we extend L-systems (without ?-rules) with two parameters d and n, and also a coding ?, which determines unambiguously a string w = ?(?^d(s))[1:n], where ? is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as ?, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence.
In this paper, we deepen the study of the measure ? and its relation with a better-established measure called ?, which builds on substring complexity. Our results show that ? and ? are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness.
We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ? of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms
Substring Complexity in Sublinear Space
Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - O((n3log b)/b2) time using O(b) space, for any b ∈ [1,n], in the comparison model. - Õ(n2/b) time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n^{1+ε})-time and Õ(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities
Feeling good about myself : real-time hermeneutics and its consequences
Questions concerning the way in which digital games produce meaning and the possibility that their reconfigurability influences the process of interpretation have been debated since the very beginning of contemporary game studies. Based on general agreement between scholars, two areas of inquiry have been distinguished: the story produced by a game, and game mechanics, or rather all the information necessary to operate within them. The so-called "Game vs. Story division" has been analysed from multiple perspectives and theoretical standpoints. Among the scholars adopting the hermeneutical angle, there seems to be a consensus regarding the two distinct interpretative processes that occur while a game is played, although they do not agree about which should be considered the primary one. Scholars arguing for the unique character of digital games tend to focus on the interpretation created while the game is played that relates to aspects of gameplay. They stress the importance of so-called "real-time hermeneutics", as this is unprecedented in other media. In turn, researchers questioning the specificity of games as a medium claim that a proper interpretation should concern itself with the stories produced through playing, rendering such interpretation similar to every other hermeneutical process. Therefore, the process of understanding a game could be explained within the existing hermeneutical framework without any need to introduce media-specific interventions. In this paper, I will investigate the process of understanding video games, following the detailed, step-by-step description of interpretation provided by Paul Ricoeur in his American lectures. In doing so, I will supplement the concept of "real life hermeneutics" by narrowing the gap between interpreting game stories and gameplay situations. While such a perspective will bring me closer to a stance which denies any specificity to video games (at least regarding interpretation), I will also describe the key difference between understanding a video game and a traditional text, and briefly point towards its possible consequences, building upon Charles Taylor’s concept of ethics of authenticity
Computing NP-Hard Repetitiveness Measures via MAX-SAT
Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional macro scheme, and the smallest size of a straight-line program. While a vast variety of implementations for heuristically computing approximations exist, exact computation of these measures has received little to no attention. In this paper, we present MAX-SAT formulations that provide the first non-trivial implementations for exact computation of smallest string attractors, smallest bidirectional macro schemes, and smallest straight-line programs. Computational experiments show that our implementations work for texts of length up to a few hundred for straight-line programs and bidirectional macro schemes, and texts even over a million for string attractors
Organisational Change in Europe: National Models or the Diffusion of a New "One Best Way"?
Drawing on the results of the third European Survey on Working Conditions undertaken in the 15 member nations of the European Union in 2000, this paper offers one of the first systematic comparisons of the adoption of new organisation forms across Europe. The paper is divided into five sections. The first describe the variables used to characterise work organisation in the 15 countries of the European Union and presents the results of the factor analysis and hierarchical clustering used to construct a 4-way typology of organisational forms, labelled the 'learning , 'lean , 'taylorist and 'traditional forms. The second section examines how the relative importance of the different organisational forms varies according to sector, firm size, occupational category, and certain demographic characteristics of the survey population. The third section makes use of multinomial logit analysis to assess the importance of national effects in the adoption of the different organisational forms. The results demonstrate significant international differences in the adoption of organisational forms characterised by strong learning dynamics and high problem-solving activity. The fourth section takes up the issue of HRM complementarities by examining the relation between organisation forms and the use of particular pay and training policies. The concluding section explores the relation between national differences in the use of the four organisational forms and differences in the way labour markets are regulated and in such research and technology measures as patenting and R&D expenditures. The results show that the relative importance of the learning form of organisation is both positively correlated with the extent of labour market regulation, as measured by the OECD's overall employment protection legislation index, and with innovative performance, as measured by the number of EPO patent application per million inhabitants.Firm organisation; learning; Europe
Indexing Highly Repetitive String Collections
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same time
offering indexed search functionalities. As this new technology permeated
through applications like bioinformatics, the string collections experienced a
growth that outperforms Moore's Law and challenges our ability of handling them
even in compressed form. It turns out, fortunately, that many of these rapidly
growing string collections are highly repetitive, so that their information
content is orders of magnitude lower than their plain size. The statistical
compression methods used for classical collections, however, are blind to this
repetitiveness, and therefore a new set of techniques has been developed in
order to properly exploit it. The resulting indexes form a new generation of
data structures able to handle the huge repetitive string collections that we
are facing.
In this survey we cover the algorithmic developments that have led to these
data structures. We describe the distinct compression paradigms that have been
used to exploit repetitiveness, the fundamental algorithmic ideas that form the
base of all the existing indexes, and the various structures that have been
proposed, comparing them both in theoretical and practical aspects. We conclude
with the current challenges in this fascinating field
Novel Results on the Number of Runs of the Burrows-Wheeler-Transform
The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is
one of the fundamental components of many current data structures in string
processing. It is central in data compression, as well as in efficient query
algorithms for sequence data, such as webpages, genomic and other biological
sequences, or indeed any textual data. The BWT lends itself well to compression
because its number of equal-letter-runs (usually referred to as ) is often
considerably lower than that of the original string; in particular, it is well
suited for strings with many repeated factors. In fact, much attention has been
paid to the parameter as measure of repetitiveness, especially to evaluate
the performance in terms of both space and time of compressed indexing data
structures.
In this paper, we investigate , the ratio of and of the number
of runs of the BWT of the reverse of . Kempa and Kociumaka [FOCS 2020] gave
the first non-trivial upper bound as , for any string
of length . However, nothing is known about the tightness of this upper
bound. We present infinite families of binary strings for which holds, thus giving the first non-trivial lower bound on
, the maximum over all strings of length .
Our results suggest that is not an ideal measure of the repetitiveness of
the string, since the number of repeated factors is invariant between the
string and its reverse. We believe that there is a more intricate relationship
between the number of runs of the BWT and the string's combinatorial
properties.Comment: 14 pages, 2 figue
- …