150 research outputs found
Expected Number of Distinct Subsequences in Randomly Generated Binary Strings
When considering binary strings, it's natural to wonder how many distinct
subsequences might exist in a given string. Given that there is an existing
algorithm which provides a straightforward way to compute the number of
distinct subsequences in a fixed string, we might next be interested in the
expected number of distinct subsequences in random strings. This expected value
is already known for random binary strings where each letter in the string is,
independently, equally likely to be a 1 or a 0. We generalize this result to
random strings where the letter 1 appears independently with probability
. Also, we make some progress in the case of random strings
from an arbitrary alphabet as well as when the string is generated by a
two-state Markov chain.Comment: 10 page
A Proof of Entropy Minimization for Outputs in Deletion Channels via Hidden Word Statistics
From the output produced by a memoryless deletion channel from a uniformly
random input of known length , one obtains a posterior distribution on the
channel input. The difference between the Shannon entropy of this distribution
and that of the uniform prior measures the amount of information about the
channel input which is conveyed by the output of length , and it is natural
to ask for which outputs this is extremized. This question was posed in a
previous work, where it was conjectured on the basis of experimental data that
the entropy of the posterior is minimized and maximized by the constant strings
and and the alternating strings
and respectively. In the present
work we confirm the minimization conjecture in the asymptotic limit using
results from hidden word statistics. We show how the analytic-combinatorial
methods of Flajolet, Szpankowski and Vall\'ee for dealing with the hidden
pattern matching problem can be applied to resolve the case of fixed output
length and , by obtaining estimates for the entropy in
terms of the moments of the posterior distribution and establishing its
minimization via a measure of autocorrelation.Comment: 11 pages, 2 figure
On the Complexity and Performance of Parsing with Derivatives
Current algorithms for context-free parsing inflict a trade-off between ease
of understanding, ease of implementation, theoretical complexity, and practical
performance. No algorithm achieves all of these properties simultaneously.
Might et al. (2011) introduced parsing with derivatives, which handles
arbitrary context-free grammars while being both easy to understand and simple
to implement. Despite much initial enthusiasm and a multitude of independent
implementations, its worst-case complexity has never been proven to be better
than exponential. In fact, high-level arguments claiming it is fundamentally
exponential have been advanced and even accepted as part of the folklore.
Performance ended up being sluggish in practice, and this sluggishness was
taken as informal evidence of exponentiality.
In this paper, we reexamine the performance of parsing with derivatives. We
have discovered that it is not exponential but, in fact, cubic. Moreover,
simple (though perhaps not obvious) modifications to the implementation by
Might et al. (2011) lead to an implementation that is not only easy to
understand but also highly performant in practice.Comment: 13 pages; 12 figures; implementation at
http://bitbucket.org/ucombinator/parsing-with-derivatives/ ; published in
PLDI '16, Proceedings of the 37th ACM SIGPLAN Conference on Programming
Language Design and Implementation, June 13 - 17, 2016, Santa Barbara, CA,
US
n-Subword Complexity Measure of DNA Sequences
String complexity has many definitions: Kolmogorov complexity [30]; Lempel-Ziv complexity [14] [27]; Linguistic complexity [42], Subword complexity [10] etc. In this thesis we will consider the n-subword complexity studied in [2] and [13]. The n-subword complexity Pw(n) of a genomic sequence w was defined in [13] as the number of distinct factors (subwords) of length n that occur in w. In [2] a new measure called the n-subword deficit was defined as the difference between the number of subwords of length n of a genomic sequence w and of a random genomic sequence of the same length. This definition was applied to short sequences (2000 base pairs). In this thesis, we will expand this definition to be applied, in addition to short sequences, also to very long sequences (from 100 base pairs to 200,000 base pairs). The aim of our work is to answer the following questions: 1. Do biological sequences show an n-subword deficit, and is their n- subword deficit length dependent? 2. Is the n-subword deficit gene specific? 3. Is the n-subword deficit genome specific? Our results indicate that the answers to questions 1 — 3 appears to be Yes, No, and No respectively. Moreover, it was found that the insects Apis mellifera and Drosophila melanogaster have genomes with the lowest maximal n-subword deficit value among other genomes in all experiments that have been conducted
Universal quantum information compression and degrees of prior knowledge
We describe a universal information compression scheme that compresses any
pure quantum i.i.d. source asymptotically to its von Neumann entropy, with no
prior knowledge of the structure of the source. We introduce a diagonalisation
procedure that enables any classical compression algorithm to be utilised in a
quantum context. Our scheme is then based on the corresponding quantum
translation of the classical Lempel-Ziv algorithm. Our methods lead to a
conceptually simple way of estimating the entropy of a source in terms of the
measurement of an associated length parameter while maintaining high fidelity
for long blocks. As a by-product we also estimate the eigenbasis of the source.
Since our scheme is based on the Lempel-Ziv method, it can be applied also to
target sequences that are not i.i.d.Comment: 17 pages, no figures. A preliminary version of this work was
presented at EQIS '02, Tokyo, September 200
- …