6 research outputs found

    Rates of DNA Sequence Profiles for Practical Values of Read Lengths

    Full text link
    A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size qq, read length \ell, and word length nn.Consequently, we demonstrate that for q2q\ge 2 and nq/21n\le q^{\ell/2-1}, the number of profile vectors is at least qκnq^{\kappa n} with κ\kappa very close to one.In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors

    On Codes for the Noisy Substring Channel

    Full text link
    We consider the problem of coding for the substring channel, in which information strings are observed only through their (multisets of) substrings. Because of applications to DNA-based data storage, due to DNA sequencing techniques, interest in this channel has renewed in recent years. In contrast to existing literature, we consider a noisy channel model, where information is subject to noise \emph{before} its substrings are sampled, motivated by in-vivo storage. We study two separate noise models, substitutions or deletions. In both cases, we examine families of codes which may be utilized for error-correction and present combinatorial bounds. Through a generalization of the concept of repeat-free strings, we show that the added required redundancy due to this imperfect observation assumption is sublinear, either when the fraction of errors in the observed substring length is sufficiently small, or when that length is sufficiently long. This suggests that no asymptotic cost in rate is incurred by this channel model in these cases.Comment: ISIT 2021 version (including all proofs

    Repeat-Free Codes

    Full text link
    In this paper we consider the problem of encoding data into repeat-free sequences in which sequences are imposed to contain any kk-tuple at most once (for predefined kk). First, the capacity and redundancy of the repeat-free constraint are calculated. Then, an efficient algorithm, which uses a single bit of redundancy, is presented to encode length-nn sequences for k=2+2log(n)k=2+2\log (n). This algorithm is then improved to support any value of kk of the form k=alog(n)k=a\log (n), for 1<a1<a, while its redundancy is o(n)o(n). We also calculate the capacity of repeat-free sequences when combined with local constraints which are given by a constrained system, and the capacity of multi-dimensional repeat-free codes.Comment: 18 page

    Rates of DNA sequence profiles for practical values of read lengths

    No full text
    A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size q, read length 1, and word length n. Consequently, we demonstrate that for q ≥ 2 and n ≤ q 1/2-1 , the number of profile vectors is at least q κn with κ very close to 1. In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for certain families of profile vectors.Accepted versio
    corecore