91 research outputs found
Identifying statistical dependence in genomic sequences via mutual information estimates
Questions of understanding and quantifying the representation and amount of
information in organisms have become a central part of biological research, as
they potentially hold the key to fundamental advances. In this paper, we
demonstrate the use of information-theoretic tools for the task of identifying
segments of biomolecules (DNA or RNA) that are statistically correlated. We
develop a precise and reliable methodology, based on the notion of mutual
information, for finding and extracting statistical as well as structural
dependencies. A simple threshold function is defined, and its use in
quantifying the level of significance of dependencies between biological
segments is explored. These tools are used in two specific applications. First,
for the identification of correlations between different parts of the maize
zmSRp32 gene. There, we find significant dependencies between the 5'
untranslated region in zmSRp32 and its alternatively spliced exons. This
observation may indicate the presence of as-yet unknown alternative splicing
mechanisms or structural scaffolds. Second, using data from the FBI's Combined
DNA Index System (CODIS), we demonstrate that our approach is particularly well
suited for the problem of discovering short tandem repeats, an application of
importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on
Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb
Structural Complexity of Random Binary Trees
Abstract — For each positive integer n, let Tn be a random rooted binary tree having finitely many vertices and exactly n leaves. We can view H(Tn), the entropy of Tn, as a measure of the structural complexity of tree Tn in the sense that approximately H(Tn) bits suffice to construct Tn. We are interested in determining conditions on the sequence (Tn: n = 1, 2, · · ·) under which H(Tn)/n converges to a limit as n → ∞. We exhibit some of our progress on the way to the solution of this problem. I
Data driven consistency (working title)
We are motivated by applications that need rich model classes to represent
them. Examples of rich model classes include distributions over large,
countably infinite supports, slow mixing Markov processes, etc. But such rich
classes may be too complex to admit estimators that converge to the truth with
convergence rates that can be uniformly bounded over the entire model class as
the sample size increases (uniform consistency). However, these rich classes
may still allow for estimators with pointwise guarantees whose performance can
be bounded in a model dependent way. The pointwise angle of course has the
drawback that the estimator performance is a function of the very unknown model
that is being estimated, and is therefore unknown. Therefore, even if the
estimator is consistent, how well it is doing may not be clear no matter what
the sample size is. Departing from the dichotomy of uniform and pointwise
consistency, a new analysis framework is explored by characterizing rich model
classes that may only admit pointwise guarantees, yet all the information about
the model needed to guage estimator accuracy can be inferred from the sample at
hand. To retain focus, we analyze the universal compression problem in this
data driven pointwise consistency framework.Comment: Working paper. Please email authors for the current versio
Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array
The longest common prefix (LCP) array is a versatile auxiliary data structure
in indexed string matching. It can be used to speed up searching using the
suffix array (SA) and provides an implicit representation of the topology of an
underlying suffix tree. The LCP array of a string of length can be
represented as an array of length words, or, in the presence of the SA, as
a bit vector of bits plus asymptotically negligible support data
structures. External memory construction algorithms for the LCP array have been
proposed, but those proposed so far have a space requirement of words
(i.e. bits) in external memory. This space requirement is in some
practical cases prohibitively expensive. We present an external memory
algorithm for constructing the bit version of the LCP array which uses
bits of additional space in external memory when given a
(compressed) BWT with alphabet size and a sampled inverse suffix array
at sampling rate . This is often a significant space gain in
practice where is usually much smaller than or even constant. We
also consider the case of computing succinct LCP arrays for circular strings
Leadership Statistics in Random Structures
The largest component (``the leader'') in evolving random structures often
exhibits universal statistical properties. This phenomenon is demonstrated
analytically for two ubiquitous structures: random trees and random graphs. In
both cases, lead changes are rare as the average number of lead changes
increases quadratically with logarithm of the system size. As a function of
time, the number of lead changes is self-similar. Additionally, the probability
that no lead change ever occurs decays exponentially with the average number of
lead changes.Comment: 5 pages, 3 figure
Scaled penalization of Brownian motion with drift and the Brownian ascent
We study a scaled version of a two-parameter Brownian penalization model
introduced by Roynette-Vallois-Yor in arXiv:math/0511102. The original model
penalizes Brownian motion with drift by the weight process
where and
is the running maximum of the Brownian motion. It was
shown there that the resulting penalized process exhibits three distinct phases
corresponding to different regions of the -plane. In this paper, we
investigate the effect of penalizing the Brownian motion concurrently with
scaling and identify the limit process. This extends a result of Roynette-Yor
for the case to the whole parameter plane and reveals two
additional "critical" phases occurring at the boundaries between the parameter
regions. One of these novel phases is Brownian motion conditioned to end at its
maximum, a process we call the Brownian ascent. We then relate the Brownian
ascent to some well-known Brownian path fragments and to a random scaling
transformation of Brownian motion recently studied by Rosenbaum-Yor.Comment: 32 pages; made additions to Section
Stability Analysis of Frame Slotted Aloha Protocol
Frame Slotted Aloha (FSA) protocol has been widely applied in Radio Frequency
Identification (RFID) systems as the de facto standard in tag identification.
However, very limited work has been done on the stability of FSA despite its
fundamental importance both on the theoretical characterisation of FSA
performance and its effective operation in practical systems. In order to
bridge this gap, we devote this paper to investigating the stability properties
of FSA by focusing on two physical layer models of practical importance, the
models with single packet reception and multipacket reception capabilities.
Technically, we model the FSA system backlog as a Markov chain with its states
being backlog size at the beginning of each frame. The objective is to analyze
the ergodicity of the Markov chain and demonstrate its properties in different
regions, particularly the instability region. By employing drift analysis, we
obtain the closed-form conditions for the stability of FSA and show that the
stability region is maximised when the frame length equals the backlog size in
the single packet reception model and when the ratio of the backlog size to
frame length equals in order of magnitude the maximum multipacket reception
capacity in the multipacket reception model. Furthermore, to characterise
system behavior in the instability region, we mathematically demonstrate the
existence of transience of the backlog Markov chain.Comment: 14 pages, submitted to IEEE Transaction on Information Theor
The Number of Symbol Comparisons in QuickSort and QuickSelect
International audienceWe revisit the classical QuickSort and QuickSelect algo-rithms, under a complexity model that fully takes into account the ele-mentary comparisons between symbols composing the records to be pro-cessed. Our probabilistic models belong to a broad category of informa-tion sources that encompasses memoryless (i.e., independent-symbols) and Markov sources, as well as many unbounded-correlation sources. We establish that, under our conditions, the average-case complexity of QuickSort is O(n log 2 n) [rather than O(n log n), classically], whereas that of QuickSelect remains O(n). Explicit expressions for the implied constants are provided by our combinatorial–analytic methods
- …