15,848 research outputs found

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    Distributed PCP Theorems for Hardness of Approximation in P

    Get PDF
    We present a new distributed model of probabilistically checkable proofs (PCP). A satisfying assignment x{0,1}nx \in \{0,1\}^n to a CNF formula φ\varphi is shared between two parties, where Alice knows x1,,xn/2x_1, \dots, x_{n/2}, Bob knows xn/2+1,,xnx_{n/2+1},\dots,x_n, and both parties know φ\varphi. The goal is to have Alice and Bob jointly write a PCP that xx satisfies φ\varphi, while exchanging little or no information. Unfortunately, this model as-is does not allow for nontrivial query complexity. Instead, we focus on a non-deterministic variant, where the players are helped by Merlin, a third party who knows all of xx. Using our framework, we obtain, for the first time, PCP-like reductions from the Strong Exponential Time Hypothesis (SETH) to approximation problems in P. In particular, under SETH we show that there are no truly-subquadratic approximation algorithms for Bichromatic Maximum Inner Product over {0,1}-vectors, Bichromatic LCS Closest Pair over permutations, Approximate Regular Expression Matching, and Diameter in Product Metric. All our inapproximability factors are nearly-tight. In particular, for the first two problems we obtain nearly-polynomial factors of 2(logn)1o(1)2^{(\log n)^{1-o(1)}}; only (1+o(1))(1+o(1))-factor lower bounds (under SETH) were known before

    Computation of distances for regular and context-free probabilistic languages

    Get PDF
    Several mathematical distances between probabilistic languages have been investigated in the literature, motivated by applications in language modeling, computational biology, syntactic pattern matching and machine learning. In most cases, only pairs of probabilistic regular languages were considered. In this paper we extend the previous results to pairs of languages generated by a probabilistic context-free grammar and a probabilistic finite automaton.PostprintPeer reviewe

    Learning probability distributions generated by finite-state machines

    Get PDF
    We review methods for inference of probability distributions generated by probabilistic automata and related models for sequence generation. We focus on methods that can be proved to learn in the inference in the limit and PAC formal models. The methods we review are state merging and state splitting methods for probabilistic deterministic automata and the recently developed spectral method for nondeterministic probabilistic automata. In both cases, we derive them from a high-level algorithm described in terms of the Hankel matrix of the distribution to be learned, given as an oracle, and then describe how to adapt that algorithm to account for the error introduced by a finite sample.Peer ReviewedPostprint (author's final draft

    Some Historical Aspects of Error Calculus by Dirichlet Forms

    Get PDF
    We discuss the main stages of development of the error calculation since the beginning of XIX-th century by insisting on what prefigures the use of Dirichlet forms and emphasizing the mathematical properties that make the use of Dirichlet forms more relevant and efficient. The purpose of the paper is mainly to clarify the concepts. We also indicate some possible future research.Comment: 18 page

    Predictability, complexity and learning

    Full text link
    We define {\em predictive information} Ipred(T)I_{\rm pred} (T) as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times TT: Ipred(T)I_{\rm pred} (T) can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite number of parameters, then Ipred(T)I_{\rm pred} (T) grows logarithmically with a coefficient that counts the dimensionality of the model space. In contrast, power--law growth is associated, for example, with the learning of infinite parameter (or nonparametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been defined both in learning theory and in the analysis of physical systems through statistical mechanics and dynamical systems theory. Further, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of Ipred(T)I_{\rm pred} (T) provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in different problems in physics, statistics, and biology.Comment: 53 pages, 3 figures, 98 references, LaTeX2

    Coding-theorem Like Behaviour and Emergence of the Universal Distribution from Resource-bounded Algorithmic Probability

    Full text link
    Previously referred to as `miraculous' in the scientific literature because of its powerful properties and its wide application as optimal solution to the problem of induction/inference, (approximations to) Algorithmic Probability (AP) and the associated Universal Distribution are (or should be) of the greatest importance in science. Here we investigate the emergence, the rates of emergence and convergence, and the Coding-theorem like behaviour of AP in Turing-subuniversal models of computation. We investigate empirical distributions of computing models in the Chomsky hierarchy. We introduce measures of algorithmic probability and algorithmic complexity based upon resource-bounded computation, in contrast to previously thoroughly investigated distributions produced from the output distribution of Turing machines. This approach allows for numerical approximations to algorithmic (Kolmogorov-Chaitin) complexity-based estimations at each of the levels of a computational hierarchy. We demonstrate that all these estimations are correlated in rank and that they converge both in rank and values as a function of computational power, despite fundamental differences between computational models. In the context of natural processes that operate below the Turing universal level because of finite resources and physical degradation, the investigation of natural biases stemming from algorithmic rules may shed light on the distribution of outcomes. We show that up to 60\% of the simplicity/complexity bias in distributions produced even by the weakest of the computational models can be accounted for by Algorithmic Probability in its approximation to the Universal Distribution.Comment: 27 pages main text, 39 pages including supplement. Online complexity calculator: http://complexitycalculator.com
    corecore