19,299 research outputs found
The Expected Variation of Random Bounded Integer Sequences of Finite Length
From the enumerative generating function of an abstract adjacency statistic, we deduce the mean and variance of the variation on random permutations, rearrangements, compositions, and bounded integer sequences of finite length
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Poisson approximation for search of rare words in DNA sequences
Using recent results on the occurrence times of a string of symbols in a
stochastic process with mixing properties, we present a new method for the
search of rare words in biological sequences generally modelled by a Markov
chain. We obtain a bound on the error between the distribution of the number of
occurrences of a word in a sequence (under a Markov model) and its Poisson
approximation. A global bound is already given by a Chen-Stein method. Our
approach, the psi-mixing method, gives local bounds. Since we only need the
error in the tails of distribution, the global uniform bound of Chen-Stein is
too large and it is a better way to consider local bounds. We search for two
thresholds on the number of occurrences from which we can regard the studied
word as an over-represented or an under-represented one. A biological role is
suggested for these over- or under-represented words. Our method gives such
thresholds for a panel of words much broader than the Chen-Stein method.
Comparing the methods, we observe a better accuracy for the psi-mixing method
for the bound of the tails of distribution. We also present the software PANOW
(available at http://stat.genopole.cnrs.fr/software/panowdir/) dedicated to the
computation of the error term and the thresholds for a studied word.Comment: 29 pages, 0 figure
About Adaptive Coding on Countable Alphabets: Max-Stable Envelope Classes
In this paper, we study the problem of lossless universal source coding for
stationary memoryless sources on countably infinite alphabets. This task is
generally not achievable without restricting the class of sources over which
universality is desired. Building on our prior work, we propose natural
families of sources characterized by a common dominating envelope. We
particularly emphasize the notion of adaptivity, which is the ability to
perform as well as an oracle knowing the envelope, without actually knowing it.
This is closely related to the notion of hierarchical universal source coding,
but with the important difference that families of envelope classes are not
discretely indexed and not necessarily nested.
Our contribution is to extend the classes of envelopes over which adaptive
universal source coding is possible, namely by including max-stable
(heavy-tailed) envelopes which are excellent models in many applications, such
as natural language modeling. We derive a minimax lower bound on the redundancy
of any code on such envelope classes, including an oracle that knows the
envelope. We then propose a constructive code that does not use knowledge of
the envelope. The code is computationally efficient and is structured to use an
{E}xpanding {T}hreshold for {A}uto-{C}ensoring, and we therefore dub it the
\textsc{ETAC}-code. We prove that the \textsc{ETAC}-code achieves the lower
bound on the minimax redundancy within a factor logarithmic in the sequence
length, and can be therefore qualified as a near-adaptive code over families of
heavy-tailed envelopes. For finite and light-tailed envelopes the penalty is
even less, and the same code follows closely previous results that explicitly
made the light-tailed assumption. Our technical results are founded on methods
from regular variation theory and concentration of measure
Optimal Berry-Esseen rates on the Wiener space: the barrier of third and fourth cumulants
Let {F_n} be a normalized sequence of random variables in some fixed Wiener
chaos associated with a general Gaussian field, and assume that E[F_n^4] -->
E[N^4]=3, where N is a standard Gaussian random variable. Our main result is
the following general bound: there exist two finite constants c,C>0 such that,
for n sufficiently large, c max(|E[F_n^3]|, E[F_n^4]-3) < d(F_n,N) < C
max(|E[F_n^3]|, E[F_n^4]-3), where d(F_n,N) = sup |E[h(F_n)] - E[h(N)]|, and h
runs over the class of all real functions with a second derivative bounded by
1. This shows that the deterministic sequence max(|E[F_n^3]|, E[F_n^4]-3)
completely characterizes the rate of convergence (with respect to smooth
distances) in CLTs involving chaotic random variables. These results are used
to determine optimal rates of convergence in the Breuer-Major central limit
theorem, with specific emphasis on fractional Gaussian noise.Comment: 29 page
About adaptive coding on countable alphabets
This paper sheds light on universal coding with respect to classes of
memoryless sources over a countable alphabet defined by an envelope function
with finite and non-decreasing hazard rate. We prove that the auto-censuring AC
code introduced by Bontemps (2011) is adaptive with respect to the collection
of such classes. The analysis builds on the tight characterization of universal
redundancy rate in terms of metric entropy % of small source classes by Opper
and Haussler (1997) and on a careful analysis of the performance of the
AC-coding algorithm. The latter relies on non-asymptotic bounds for maxima of
samples from discrete distributions with finite and non-decreasing hazard rate
Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees
We establish necessary and sufficient conditions for consistent root
reconstruction in continuous-time Markov models with countable state space on
bounded-height trees. Here a root state estimator is said to be consistent if
the probability that it returns to the true root state converges to 1 as the
number of leaves tends to infinity. We also derive quantitative bounds on the
error of reconstruction. Our results answer a question of Gascuel and Steel and
have implications for ancestral sequence reconstruction in a classical
evolutionary model of nucleotide insertion and deletion.Comment: 30 pages, 3 figures, title of reference [FR] is update
- âŠ