Search CORE

150 research outputs found

Expected Number of Distinct Subsequences in Randomly Generated Binary Strings

Author: Biers-Ariel Yonah
Godbole Anant
Kelley Elizabeth
Publication venue
Publication date: 01/06/2018
Field of study

When considering binary strings, it's natural to wonder how many distinct subsequences might exist in a given string. Given that there is an existing algorithm which provides a straightforward way to compute the number of distinct subsequences in a fixed string, we might next be interested in the expected number of distinct subsequences in random strings. This expected value is already known for random binary strings where each letter in the string is, independently, equally likely to be a 1 or a 0. We generalize this result to random strings where the letter 1 appears independently with probability

\alpha \in [0,1]

. Also, we make some progress in the case of random strings from an arbitrary alphabet as well as when the string is generated by a two-state Markov chain.Comment: 10 page

arXiv.org e-Print Archive

Episciences.org

Directory of Open Access Journals

A Proof of Entropy Minimization for Outputs in Deletion Channels via Hidden Word Statistics

Author: Atashpendar Arash
Mestel David
Roscoe A. W.
Ryan Peter Y. A.
Publication venue
Publication date: 30/07/2018
Field of study

From the output produced by a memoryless deletion channel from a uniformly random input of known length

n

, one obtains a posterior distribution on the channel input. The difference between the Shannon entropy of this distribution and that of the uniform prior measures the amount of information about the channel input which is conveyed by the output of length

m

, and it is natural to ask for which outputs this is extremized. This question was posed in a previous work, where it was conjectured on the basis of experimental data that the entropy of the posterior is minimized and maximized by the constant strings

\texttt{000}\ldots

and

\texttt{111}\ldots

and the alternating strings

\texttt{0101}\ldots

and

\texttt{1010}\ldots

respectively. In the present work we confirm the minimization conjecture in the asymptotic limit using results from hidden word statistics. We show how the analytic-combinatorial methods of Flajolet, Szpankowski and Vall\'ee for dealing with the hidden pattern matching problem can be applied to resolve the case of fixed output length and

n\rightarrow\infty

, by obtaining estimates for the entropy in terms of the moments of the posterior distribution and establishing its minimization via a measure of autocorrelation.Comment: 11 pages, 2 figure

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

On the Complexity and Performance of Parsing with Derivatives

Author: Adams Michael D.
Hollenbeck Celeste
Might Matthew
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 16/04/2016
Field of study

Current algorithms for context-free parsing inflict a trade-off between ease of understanding, ease of implementation, theoretical complexity, and practical performance. No algorithm achieves all of these properties simultaneously. Might et al. (2011) introduced parsing with derivatives, which handles arbitrary context-free grammars while being both easy to understand and simple to implement. Despite much initial enthusiasm and a multitude of independent implementations, its worst-case complexity has never been proven to be better than exponential. In fact, high-level arguments claiming it is fundamentally exponential have been advanced and even accepted as part of the folklore. Performance ended up being sluggish in practice, and this sluggishness was taken as informal evidence of exponentiality. In this paper, we reexamine the performance of parsing with derivatives. We have discovered that it is not exponential but, in fact, cubic. Moreover, simple (though perhaps not obvious) modifications to the implementation by Might et al. (2011) lead to an implementation that is not only easy to understand but also highly performant in practice.Comment: 13 pages; 12 figures; implementation at http://bitbucket.org/ucombinator/parsing-with-derivatives/ ; published in PLDI '16, Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 13 - 17, 2016, Santa Barbara, CA, US

arXiv.org e-Print Archive

Crossref

n-Subword Complexity Measure of DNA Sequences

Author: Shihadeh Noman
Publication venue: Scholarship@Western
Publication date: 01/01/2009
Field of study

String complexity has many definitions: Kolmogorov complexity [30]; Lempel-Ziv complexity [14] [27]; Linguistic complexity [42], Subword complexity [10] etc. In this thesis we will consider the n-subword complexity studied in [2] and [13]. The n-subword complexity Pw(n) of a genomic sequence w was defined in [13] as the number of distinct factors (subwords) of length n that occur in w. In [2] a new measure called the n-subword deficit was defined as the difference between the number of subwords of length n of a genomic sequence w and of a random genomic sequence of the same length. This definition was applied to short sequences (2000 base pairs). In this thesis, we will expand this definition to be applied, in addition to short sequences, also to very long sequences (from 100 base pairs to 200,000 base pairs). The aim of our work is to answer the following questions: 1. Do biological sequences show an n-subword deficit, and is their n- subword deficit length dependent? 2. Is the n-subword deficit gene specific? 3. Is the n-subword deficit genome specific? Our results indicate that the answers to questions 1 — 3 appears to be Yes, No, and No respectively. Moreover, it was found that the insects Apis mellifera and Drosophila melanogaster have genomes with the lowest maximal n-subword deficit value among other genomes in all experiments that have been conducted

Scholarship@Western

Universal quantum information compression and degrees of prior knowledge

Author: Jozsa Richard
Presnell Stuart
Publication venue
Publication date: 29/10/2002
Field of study

We describe a universal information compression scheme that compresses any pure quantum i.i.d. source asymptotically to its von Neumann entropy, with no prior knowledge of the structure of the source. We introduce a diagonalisation procedure that enables any classical compression algorithm to be utilised in a quantum context. Our scheme is then based on the corresponding quantum translation of the classical Lempel-Ziv algorithm. Our methods lead to a conceptually simple way of estimating the entropy of a source in terms of the measurement of an associated length parameter while maintaining high fidelity for long blocks. As a by-product we also estimate the eigenbasis of the source. Since our scheme is based on the Lempel-Ziv method, it can be applied also to target sequences that are not i.i.d.Comment: 17 pages, no figures. A preliminary version of this work was presented at EQIS '02, Tokyo, September 200

arXiv.org e-Print Archive

CiteSeerX

CERN Document Server

Explore Bristol Research