24 research outputs found

    The Capacity of String-Duplication Systems

    Get PDF
    It is known that the majority of the human genome consists of duplicated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from duplicated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence using simple duplication rules, including those resembling genomic-duplication processes. In other words, our goal is to find the capacity, or the expressive power, of these string-duplication systems. Our results include exact capacities, and bounds on the capacities, of four fundamental string-duplication systems. The study of these fundamental biologically inspired systems is an important step toward modeling and analyzing more complex biological processes

    The capacity of string-duplication systems

    Get PDF
    It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple duplication rules, including those resembling genomic duplication processes. In other words, our goal is to find out the capacity, or the expressive power, of these string-duplication systems. Our results include the exact capacities, and bounds on the capacities, of four fundamental string-duplication systems

    Noise and Uncertainty in String-Duplication Systems

    Get PDF
    Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into account the presence of other types of mutations. In this work, we consider the capacity of duplication mutations in the presence of point-mutation noise, and so quantify the generation power of these mutations. We show that if the number of point mutations is vanishingly small compared to the number of duplication mutations of a constant length, the generation capacity of these mutations is zero. However, if the number of point mutations increases to a constant fraction of the number of duplications, then the capacity is nonzero. Lower and upper bounds for this capacity are also presented. Another problem that we study is concerned with the mismatch between code design and channel in data storage in the DNA of living organisms with respect to duplication mutations. In this context, we consider the uncertainty of such a mismatched coding scheme measured as the maximum number of input codewords that can lead to the same output

    The Tandem Duplication Distance Is NP-Hard

    Get PDF
    In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems

    Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

    Full text link
    DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

    The Capacity of Some P\'olya String Models

    Get PDF
    We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics

    A Stochastic Model for Genomic Interspersed Duplication

    Get PDF
    Mutation processes such as point mutation, insertion, deletion, and duplication (including tandem and interspersed duplication) have an important role in evolution, as they lead to genomic diversity, and thus to phenotypic variation. In this work, we study the expressive power of interspersed duplication, i.e., its ability to generate diversity, via a simple but fundamental stochastic model, where the length and the location of the subsequence that is duplicated and the point of insertion of the copy are chosen randomly. In contrast to combinatorial models, where the goal is to determine the set of possible outcomes regardless of their likelihood, in stochastic systems, we investigate the properties of the set of high-probability sequences. In particular we provide results regarding the asymptotic behavior of frequencies of symbols and short words in a sequence evolving through interspersed duplication. The study of such a systems is an important step towards the design and analysis of more realistic and sophisticated models of genomic mutation processes

    Codes Correcting All Patterns of Tandem-Duplication Errors of Maximum Length 3

    Full text link
    The set of all q q -ary strings that do not contain repeated substrings of length \leq\ell forms a code correcting all patterns of tandem-duplication errors of length \leq\ell , when {1,2,3} \ell \in \{1, 2, 3\} . For {1,2} \ell \in \{1, 2\} , this code is also known to be optimal in terms of asymptotic rate. The purpose of this paper is to demonstrate asymptotic optimality for the case =3 \ell = 3 as well, and to give the corresponding characterization of the zero-error capacity of the (3) (\leq 3) -tandem-duplication channel. This settles the zero-error problem for () (\leq\ell) -tandem-duplication channels in all cases where duplication roots of strings are unique.Comment: 5 pages (double-column format

    Attaining the 2nd Chargaff Rule by Tandem Duplications

    Get PDF
    Erwin Chargaff in 1950 made an experimental observation that the count of A is equal to the count of T and the count of C is equal to the count of G in DNA. This observation played a crucial rule in the discovery of the double stranded helix structure by Watson and Crick. However, this symmetry was also observed in single stranded DNA. This phenomenon was termed as 2nd Chargaff Rule. This symmetry has been verified experimentally in genomes of several different species not only for mononucleotides but also for reverse complement pairs of larger lengths up to a small error. While the symmetry in double stranded DNA is related to base pairing, and replication mechanisms, the symmetry in a single stranded DNA is still a mystery in its function and source. In this work, we define a sequence generation model based on reverse complement tandem duplications. We show that this model generates sequences that satisfy the 2nd Chargaff Rule even when the duplication lengths are very small when compared to the length of sequences. We also provide estimates on the number of generations that are needed by this model to generate sequences that satisfy 2nd Chargaff Rule. We provide theoretical bounds on the disruption in symmetry for different values of duplication lengths under this model. Moreover, we experimentally compare the disruption in the symmetry incurred by our model with what is observed in human genome data
    corecore