150 research outputs found

    Expected Number of Distinct Subsequences in Randomly Generated Binary Strings

    Full text link
    When considering binary strings, it's natural to wonder how many distinct subsequences might exist in a given string. Given that there is an existing algorithm which provides a straightforward way to compute the number of distinct subsequences in a fixed string, we might next be interested in the expected number of distinct subsequences in random strings. This expected value is already known for random binary strings where each letter in the string is, independently, equally likely to be a 1 or a 0. We generalize this result to random strings where the letter 1 appears independently with probability α∈[0,1]\alpha \in [0,1]. Also, we make some progress in the case of random strings from an arbitrary alphabet as well as when the string is generated by a two-state Markov chain.Comment: 10 page

    A Proof of Entropy Minimization for Outputs in Deletion Channels via Hidden Word Statistics

    Get PDF
    From the output produced by a memoryless deletion channel from a uniformly random input of known length nn, one obtains a posterior distribution on the channel input. The difference between the Shannon entropy of this distribution and that of the uniform prior measures the amount of information about the channel input which is conveyed by the output of length mm, and it is natural to ask for which outputs this is extremized. This question was posed in a previous work, where it was conjectured on the basis of experimental data that the entropy of the posterior is minimized and maximized by the constant strings 000…\texttt{000}\ldots and 111…\texttt{111}\ldots and the alternating strings 0101…\texttt{0101}\ldots and 1010…\texttt{1010}\ldots respectively. In the present work we confirm the minimization conjecture in the asymptotic limit using results from hidden word statistics. We show how the analytic-combinatorial methods of Flajolet, Szpankowski and Vall\'ee for dealing with the hidden pattern matching problem can be applied to resolve the case of fixed output length and n→∞n\rightarrow\infty, by obtaining estimates for the entropy in terms of the moments of the posterior distribution and establishing its minimization via a measure of autocorrelation.Comment: 11 pages, 2 figure

    On the Complexity and Performance of Parsing with Derivatives

    Full text link
    Current algorithms for context-free parsing inflict a trade-off between ease of understanding, ease of implementation, theoretical complexity, and practical performance. No algorithm achieves all of these properties simultaneously. Might et al. (2011) introduced parsing with derivatives, which handles arbitrary context-free grammars while being both easy to understand and simple to implement. Despite much initial enthusiasm and a multitude of independent implementations, its worst-case complexity has never been proven to be better than exponential. In fact, high-level arguments claiming it is fundamentally exponential have been advanced and even accepted as part of the folklore. Performance ended up being sluggish in practice, and this sluggishness was taken as informal evidence of exponentiality. In this paper, we reexamine the performance of parsing with derivatives. We have discovered that it is not exponential but, in fact, cubic. Moreover, simple (though perhaps not obvious) modifications to the implementation by Might et al. (2011) lead to an implementation that is not only easy to understand but also highly performant in practice.Comment: 13 pages; 12 figures; implementation at http://bitbucket.org/ucombinator/parsing-with-derivatives/ ; published in PLDI '16, Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 13 - 17, 2016, Santa Barbara, CA, US

    n-Subword Complexity Measure of DNA Sequences

    Get PDF
    String complexity has many definitions: Kolmogorov complexity [30]; Lempel-Ziv complexity [14] [27]; Linguistic complexity [42], Subword complexity [10] etc. In this thesis we will consider the n-subword complexity studied in [2] and [13]. The n-subword complexity Pw(n) of a genomic sequence w was defined in [13] as the number of distinct factors (subwords) of length n that occur in w. In [2] a new measure called the n-subword deficit was defined as the difference between the number of subwords of length n of a genomic sequence w and of a random genomic sequence of the same length. This definition was applied to short sequences (2000 base pairs). In this thesis, we will expand this definition to be applied, in addition to short sequences, also to very long sequences (from 100 base pairs to 200,000 base pairs). The aim of our work is to answer the following questions: 1. Do biological sequences show an n-subword deficit, and is their n- subword deficit length dependent? 2. Is the n-subword deficit gene specific? 3. Is the n-subword deficit genome specific? Our results indicate that the answers to questions 1 — 3 appears to be Yes, No, and No respectively. Moreover, it was found that the insects Apis mellifera and Drosophila melanogaster have genomes with the lowest maximal n-subword deficit value among other genomes in all experiments that have been conducted

    Universal quantum information compression and degrees of prior knowledge

    Get PDF
    We describe a universal information compression scheme that compresses any pure quantum i.i.d. source asymptotically to its von Neumann entropy, with no prior knowledge of the structure of the source. We introduce a diagonalisation procedure that enables any classical compression algorithm to be utilised in a quantum context. Our scheme is then based on the corresponding quantum translation of the classical Lempel-Ziv algorithm. Our methods lead to a conceptually simple way of estimating the entropy of a source in terms of the measurement of an associated length parameter while maintaining high fidelity for long blocks. As a by-product we also estimate the eigenbasis of the source. Since our scheme is based on the Lempel-Ziv method, it can be applied also to target sequences that are not i.i.d.Comment: 17 pages, no figures. A preliminary version of this work was presented at EQIS '02, Tokyo, September 200
    • …
    corecore