91 research outputs found

    Identifying statistical dependence in genomic sequences via mutual information estimates

    Get PDF
    Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the 5' untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's Combined DNA Index System (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb

    Structural Complexity of Random Binary Trees

    Get PDF
    Abstract — For each positive integer n, let Tn be a random rooted binary tree having finitely many vertices and exactly n leaves. We can view H(Tn), the entropy of Tn, as a measure of the structural complexity of tree Tn in the sense that approximately H(Tn) bits suffice to construct Tn. We are interested in determining conditions on the sequence (Tn: n = 1, 2, · · ·) under which H(Tn)/n converges to a limit as n → ∞. We exhibit some of our progress on the way to the solution of this problem. I

    Data driven consistency (working title)

    Full text link
    We are motivated by applications that need rich model classes to represent them. Examples of rich model classes include distributions over large, countably infinite supports, slow mixing Markov processes, etc. But such rich classes may be too complex to admit estimators that converge to the truth with convergence rates that can be uniformly bounded over the entire model class as the sample size increases (uniform consistency). However, these rich classes may still allow for estimators with pointwise guarantees whose performance can be bounded in a model dependent way. The pointwise angle of course has the drawback that the estimator performance is a function of the very unknown model that is being estimated, and is therefore unknown. Therefore, even if the estimator is consistent, how well it is doing may not be clear no matter what the sample size is. Departing from the dichotomy of uniform and pointwise consistency, a new analysis framework is explored by characterizing rich model classes that may only admit pointwise guarantees, yet all the information about the model needed to guage estimator accuracy can be inferred from the sample at hand. To retain focus, we analyze the universal compression problem in this data driven pointwise consistency framework.Comment: Working paper. Please email authors for the current versio

    Exploiting a Computation Reuse Cache to Reduce Energy in Network Processors

    Full text link

    Low Space External Memory Construction of the Succinct Permuted Longest Common Prefix Array

    Full text link
    The longest common prefix (LCP) array is a versatile auxiliary data structure in indexed string matching. It can be used to speed up searching using the suffix array (SA) and provides an implicit representation of the topology of an underlying suffix tree. The LCP array of a string of length nn can be represented as an array of length nn words, or, in the presence of the SA, as a bit vector of 2n2n bits plus asymptotically negligible support data structures. External memory construction algorithms for the LCP array have been proposed, but those proposed so far have a space requirement of O(n)O(n) words (i.e. O(nlogn)O(n \log n) bits) in external memory. This space requirement is in some practical cases prohibitively expensive. We present an external memory algorithm for constructing the 2n2n bit version of the LCP array which uses O(nlogσ)O(n \log \sigma) bits of additional space in external memory when given a (compressed) BWT with alphabet size σ\sigma and a sampled inverse suffix array at sampling rate O(logn)O(\log n). This is often a significant space gain in practice where σ\sigma is usually much smaller than nn or even constant. We also consider the case of computing succinct LCP arrays for circular strings

    Leadership Statistics in Random Structures

    Full text link
    The largest component (``the leader'') in evolving random structures often exhibits universal statistical properties. This phenomenon is demonstrated analytically for two ubiquitous structures: random trees and random graphs. In both cases, lead changes are rare as the average number of lead changes increases quadratically with logarithm of the system size. As a function of time, the number of lead changes is self-similar. Additionally, the probability that no lead change ever occurs decays exponentially with the average number of lead changes.Comment: 5 pages, 3 figure

    Scaled penalization of Brownian motion with drift and the Brownian ascent

    Full text link
    We study a scaled version of a two-parameter Brownian penalization model introduced by Roynette-Vallois-Yor in arXiv:math/0511102. The original model penalizes Brownian motion with drift hRh\in\mathbb{R} by the weight process (exp(νSt):t0){\big(\exp(\nu S_t):t\geq 0\big)} where νR\nu\in\mathbb{R} and (St:t0)\big(S_t:t\geq 0\big) is the running maximum of the Brownian motion. It was shown there that the resulting penalized process exhibits three distinct phases corresponding to different regions of the (ν,h)(\nu,h)-plane. In this paper, we investigate the effect of penalizing the Brownian motion concurrently with scaling and identify the limit process. This extends a result of Roynette-Yor for the ν<0, h=0{\nu<0,~h=0} case to the whole parameter plane and reveals two additional "critical" phases occurring at the boundaries between the parameter regions. One of these novel phases is Brownian motion conditioned to end at its maximum, a process we call the Brownian ascent. We then relate the Brownian ascent to some well-known Brownian path fragments and to a random scaling transformation of Brownian motion recently studied by Rosenbaum-Yor.Comment: 32 pages; made additions to Section

    Stability Analysis of Frame Slotted Aloha Protocol

    Full text link
    Frame Slotted Aloha (FSA) protocol has been widely applied in Radio Frequency Identification (RFID) systems as the de facto standard in tag identification. However, very limited work has been done on the stability of FSA despite its fundamental importance both on the theoretical characterisation of FSA performance and its effective operation in practical systems. In order to bridge this gap, we devote this paper to investigating the stability properties of FSA by focusing on two physical layer models of practical importance, the models with single packet reception and multipacket reception capabilities. Technically, we model the FSA system backlog as a Markov chain with its states being backlog size at the beginning of each frame. The objective is to analyze the ergodicity of the Markov chain and demonstrate its properties in different regions, particularly the instability region. By employing drift analysis, we obtain the closed-form conditions for the stability of FSA and show that the stability region is maximised when the frame length equals the backlog size in the single packet reception model and when the ratio of the backlog size to frame length equals in order of magnitude the maximum multipacket reception capacity in the multipacket reception model. Furthermore, to characterise system behavior in the instability region, we mathematically demonstrate the existence of transience of the backlog Markov chain.Comment: 14 pages, submitted to IEEE Transaction on Information Theor

    The Number of Symbol Comparisons in QuickSort and QuickSelect

    Get PDF
    International audienceWe revisit the classical QuickSort and QuickSelect algo-rithms, under a complexity model that fully takes into account the ele-mentary comparisons between symbols composing the records to be pro-cessed. Our probabilistic models belong to a broad category of informa-tion sources that encompasses memoryless (i.e., independent-symbols) and Markov sources, as well as many unbounded-correlation sources. We establish that, under our conditions, the average-case complexity of QuickSort is O(n log 2 n) [rather than O(n log n), classically], whereas that of QuickSelect remains O(n). Explicit expressions for the implied constants are provided by our combinatorial–analytic methods
    corecore