98 research outputs found

    Asymptotic Improvement of the Gilbert-Varshamov Bound on the Size of Binary Codes

    Full text link
    Given positive integers nn and dd, let A2(n,d)A_2(n,d) denote the maximum size of a binary code of length nn and minimum distance dd. The well-known Gilbert-Varshamov bound asserts that A2(n,d)2n/V(n,d1)A_2(n,d) \geq 2^n/V(n,d-1), where V(n,d)=i=0d(ni)V(n,d) = \sum_{i=0}^{d} {n \choose i} is the volume of a Hamming sphere of radius dd. We show that, in fact, there exists a positive constant cc such that A2(n,d)c2nV(n,d1)log2V(n,d1) A_2(n,d) \geq c \frac{2^n}{V(n,d-1)} \log_2 V(n,d-1) whenever d/n0.499d/n \le 0.499. The result follows by recasting the Gilbert- Varshamov bound into a graph-theoretic framework and using the fact that the corresponding graph is locally sparse. Generalizations and extensions of this result are briefly discussed.Comment: 10 pages, 3 figures; to appear in the IEEE Transactions on Information Theory, submitted August 12, 2003, revised March 28, 200

    The Cycle Structure of LFSR with Arbitrary Characteristic Polynomial over Finite Fields

    Full text link
    We determine the cycle structure of linear feedback shift register with arbitrary monic characteristic polynomial over any finite field. For each cycle, a method to find a state and a new way to represent the state are proposed.Comment: An extended abstract containing preliminary results was presented at SETA 201

    Codes and Sequences for Information Retrieval and Stream Ciphers

    Get PDF
    Given a self-similar structure in codes and de Bruijn sequences, recursive techniques may be used to analyze and construct them. Batch codes partition the indices of code words into m buckets, where recovery of t symbols is accomplished by accessing at most tau in each bucket. This finds use in the retrieval of information spread over several devices. We introduce the concept of optimal batch codes, showing that binary Hamming codes and first order Reed-Muller codes are optimal. Then we study batch properties of binary Reed-Muller codes which have order less than half their length. Cartesian codes are defined by the evaluation of polynomials at a subset of points in F_q. We partition F_q into buckets defined by the quotient with a subspace V. Several properties equivalent to (V intersect ) = {0} for all i,j between 1 and mu are explored. With this framework, a code in F_q^(mu-1) capable of reconstructing mu indices is expanded to one in F_q^(mu) capable of reconstructing mu+1 indices. Using a base case in F_q^3, we are able to prove batch properties for codes in F_q. We generalize this to Cartesian Codes with a limit on the degree mu of the polynomials. De Bruijn sequences are cyclic sequences of length q^n that contain every q-ary word of length n exactly once. The pseudorandom properties of such sequences make them useful for stream ciphers. Under a particular homomorphism, the preimages of a binary de Bruijn sequence form two cycles. We examine a method for identifying points where these sequences may be joined to make a de Bruijn sequence of order n. Using the recursive structure of this construction, we are able to calculate sums of subsequences in O(n^4 log(n)) time, and the location of a word in O(n^5 log(n)) time. Together, these functions allow us to check the validity of any potential toggle point, which provides a method for efficiently generating a recursive specification. Each successful step takes O(k^5 log(k)), for k from 3 to n

    Universal Source Coding in the Non-Asymptotic Regime

    Get PDF
    abstract: Fundamental limits of fixed-to-variable (F-V) and variable-to-fixed (V-F) length universal source coding at short blocklengths is characterized. For F-V length coding, the Type Size (TS) code has previously been shown to be optimal up to the third-order rate for universal compression of all memoryless sources over finite alphabets. The TS code assigns sequences ordered based on their type class sizes to binary strings ordered lexicographically. Universal F-V coding problem for the class of first-order stationary, irreducible and aperiodic Markov sources is first considered. Third-order coding rate of the TS code for the Markov class is derived. A converse on the third-order coding rate for the general class of F-V codes is presented which shows the optimality of the TS code for such Markov sources. This type class approach is then generalized for compression of the parametric sources. A natural scheme is to define two sequences to be in the same type class if and only if they are equiprobable under any model in the parametric class. This natural approach, however, is shown to be suboptimal. A variation of the Type Size code is introduced, where type classes are defined based on neighborhoods of minimal sufficient statistics. Asymptotics of the overflow rate of this variation is derived and a converse result establishes its optimality up to the third-order term. These results are derived for parametric families of i.i.d. sources as well as Markov sources. Finally, universal V-F length coding of the class of parametric sources is considered in the short blocklengths regime. The proposed dictionary which is used to parse the source output stream, consists of sequences in the boundaries of transition from low to high quantized type complexity, hence the name Type Complexity (TC) code. For large enough dictionary, the ϵ\epsilon-coding rate of the TC code is derived and a converse result is derived showing its optimality up to the third-order term.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Decoding the Past

    Get PDF
    The human genome is continuously evolving, hence the sequenced genome is a snapshot in time of this evolving entity. Over time, the genome accumulates mutations that can be associated with different phenotypes - like physical traits, diseases, etc. Underlying mutation accumulation is an evolution channel (the term channel is motivated by the notion of communication channel introduced by Shannon [1] in 1948 and started the area of Information Theory), which is controlled by hereditary, environmental, and stochastic factors. The premise of this thesis is to understand the human genome using information theory framework. In particular, it focuses on: (i) the analysis and characterization of the evolution channel using measures of capacity, expressiveness, evolution distance, and uniqueness of ancestry and uses these insights for (ii) the design of error correcting codes for DNA storage, (iii) inversion symmetry in the genome and (iv) cancer classification. The mutational events characterizing this evolution channel can be divided into two categories, namely point mutations and duplications. While evolution through point mutations is unconstrained, giving rise to combinatorially many possibilities of what could have happened in the past, evolution through duplications adds constraints limiting the number of those possibilities. Further, more than 50% of the genome has been observed to consist of repeated sequences. We focus on the much constrained form of duplications known as tandem duplications in order to understand the limits of evolution by duplication. Our sequence evolution model consists of a starting sequence called seed and a set of tandem duplication rules. We find limits on the diversity of sequences that can be generated by tandem duplications using measures of capacity and expressiveness. Additionally, we calculate bounds on the duplication distance which is used to measure the timing of generation by these duplications. We also ask questions about the uniqueness of seed for a given sequence and completely characterize the duplication length sets where the seed is unique or non-unique. These insights also led us to design error correcting codes for any number of tandem duplication errors that are useful for DNA-storage based applications. For uniform duplication length and duplication length bounded by 2, our designed codes achieve channel capacity. We also define and measure uncertainty in decoding when the duplication channel is misinformed. Moreover, we add substitutions to our tandem duplication model and calculate sequence generation diversity for a given budget of substitutions. We also use our duplication model to explain the inversion symmetry observed in the genome of many species. The inversion symmetry is popularly known as the 2nd Chargaff Rule, according to which in a single strand DNA, the frequency of a k-mer is almost the same as the frequency of its reverse complement. The insights gained by these problems led us to investigate the tandem repeat regions in the genome. Tandem repeat regions in the genome can be traced back in time algorithmically to make inference about the effect of the hereditary, environmental and stochastic factors on the mutation rate of the genome. By inferring the evolutionary history of the tandem repeat regions, we show how this knowledge can be used to make predictions about the risk of incurring a mutation based disease, specifically cancer. More precisely, we introduce the concept of mutation profiles that are computed without any comparative analysis, but instead by analyzing the short tandem repeat regions in a single healthy genome and capturing information about the individual's evolution channel. Using gradient boosting on data from more than 5,000 TCGA (The Cancer Genome Atlas) cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.</p

    Densities of Codes of Various Linearity Degrees in Translation-Invariant Metric Spaces

    Full text link
    We investigate the asymptotic density of error-correcting codes with good distance properties and prescribed linearity degree, including sublinear and nonlinear codes. We focus on the general setting of finite translation-invariant metric spaces, and then specialize our results to the Hamming metric, to the rank metric, and to the sum-rank metric. Our results show that the asymptotic density of codes heavily depends on the imposed linearity degree and the chosen metric
    corecore