4 research outputs found

    Sets Represented as the Length-n Factors of a Word

    Full text link
    In this paper we consider the following problems: how many different subsets of Sigma^n can occur as set of all length-n factors of a finite word? If a subset is representable, how long a word do we need to represent it? How many such subsets are represented by words of length t? For the first problem, we give upper and lower bounds of the form alpha^(2^n) in the binary case. For the second problem, we give a weak upper bound and some experimental data. For the third problem, we give a closed-form formula in the case where n <= t < 2n. Algorithmic variants of these problems have previously been studied under the name "shortest common superstring"

    Rates of DNA Sequence Profiles for Practical Values of Read Lengths

    Full text link
    A recent study by one of the authors has demonstrated the importance of profile vectors in DNA-based data storage. We provide exact values and lower bounds on the number of profile vectors for finite values of alphabet size qq, read length \ell, and word length nn.Consequently, we demonstrate that for q2q\ge 2 and nq/21n\le q^{\ell/2-1}, the number of profile vectors is at least qκnq^{\kappa n} with κ\kappa very close to one.In addition to enumeration results, we provide a set of efficient encoding and decoding algorithms for each of two particular families of profile vectors

    Codes for DNA Storage Channels

    Full text link
    We consider the problem of assembling a sequence based on a collection of its substrings observed through a noisy channel. The mathematical basis of the problem is the construction and design of sequences that may be discriminated based on a collection of their substrings observed through a noisy channel. We explain the connection between the sequence reconstruction problem and the problem of DNA synthesis and sequencing, and introduce the notion of a DNA storage channel. We analyze the number of sequence equivalence classes under the channel mapping and propose new asymmetric coding techniques to combat the effects of synthesis and sequencing noise. In our analysis, we make use of restricted de Bruijn graphs and Ehrhart theory for rational polytopes.Comment: 32 pages, 5 figure

    Two results on words

    Get PDF
    The study of combinatorial patterns of words has raised great interest since the early 20th century. In this master's thesis presentation we study two combinatorial patterns. The first pattern is “abelian k-th power free” and the second one is “representability of sets of words of equal length”. For the first pattern we study the context-freeness of non-abelian k-th powers. A word is a non-abelian k-th power if it cannot be factorized in the form w1w2...wk where the wi are permutations of w1 for 2 ≤ i ≤ k. We show that neither the language of non-abelian squares nor the language of non- abelian cubes is context-free. For the second pattern we study the representability of a set of words of fixed length. A set S of words of length n is representable if there exists some word w such that the set of length-n factors of w equals S. We will give lower and upper bounds for the number of such representable sets. Furthermore, we study a variation of the problem: we fix a length t, and try to evaluate the number of sets of words of length n such that there exists some word w of length t such that the set of length-n factors of w equals S. We give a closed-form formula in the case where n ≤ t < 2n. In particular, we give a characterization on two distinct words having the same subset of length-n factors
    corecore