Search CORE

24 research outputs found

The Capacity of String-Duplication Systems

Author: Bruck Jehoshua
Farnoud Farzad
Schwartz Moshe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

It is known that the majority of the human genome consists of duplicated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from duplicated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence using simple duplication rules, including those resembling genomic-duplication processes. In other words, our goal is to find the capacity, or the expressive power, of these string-duplication systems. Our results include exact capacities, and bounds on the capacities, of four fundamental string-duplication systems. The study of these fundamental biologically inspired systems is an important step toward modeling and analyzing more complex biological processes

Caltech Authors

The capacity of string-duplication systems

Author: Bruck Jehoshua
Farnoud Farzad
Schwartz Moshe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2014
Field of study

It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In this paper, we investigate the possibility of constructing an exponentially large number of sequences from a short initial sequence and simple duplication rules, including those resembling genomic duplication processes. In other words, our goal is to find out the capacity, or the expressive power, of these string-duplication systems. Our results include the exact capacities, and bounds on the capacities, of four fundamental string-duplication systems

Crossref

Caltech Authors

Noise and Uncertainty in String-Duplication Systems

Author: Bruck Jehoshua
Farnoud (Hassanzadeh) Farzad
Jain Siddharth
Schwartz Moshe
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2017
Field of study

Duplication mutations play a critical role in the generation of biological sequences. Simultaneously, they have a deleterious effect on data stored using in-vivo DNA data storage. While duplications have been studied both as a sequence-generation mechanism and in the context of error correction, for simplicity these studies have not taken into account the presence of other types of mutations. In this work, we consider the capacity of duplication mutations in the presence of point-mutation noise, and so quantify the generation power of these mutations. We show that if the number of point mutations is vanishingly small compared to the number of duplication mutations of a constant length, the generation capacity of these mutations is zero. However, if the number of point mutations increases to a constant fraction of the number of duplications, then the capacity is nonzero. Lower and upper bounds for this capacity are also presented. Another problem that we study is concerned with the mismatch between code design and channel in data storage in the DNA of living organisms with respect to duplication mutations. In this context, we consider the uncertainty of such a mismatched coding scheme measured as the maximum number of input codewords that can lead to the same output

Crossref

Caltech Authors

The Tandem Duplication Distance Is NP-Hard

Author: Lafond Manuel
Zhu Binhai
Zou Peng
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)
Publication date: 12/06/2019
Field of study

In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

Author: Schwartz Moshe
Yehezkeally Yonatan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/09/2019
Field of study

DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

arXiv.org e-Print Archive

Crossref

The Capacity of Some P\'olya String Models

Author: Bruck Jehoshua
Elishco Ohad
Farnoud Farzad
Schwartz Moshe
Publication venue
Publication date: 01/08/2018
Field of study

We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics

arXiv.org e-Print Archive

Caltech Authors

A Stochastic Model for Genomic Interspersed Duplication

Author: Bruck Jehoshua
Farnoud Farzad
Schwartz Moshe
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2015
Field of study

Mutation processes such as point mutation, insertion, deletion, and duplication (including tandem and interspersed duplication) have an important role in evolution, as they lead to genomic diversity, and thus to phenotypic variation. In this work, we study the expressive power of interspersed duplication, i.e., its ability to generate diversity, via a simple but fundamental stochastic model, where the length and the location of the subsequence that is duplicated and the point of insertion of the copy are chosen randomly. In contrast to combinatorial models, where the goal is to determine the set of possible outcomes regardless of their likelihood, in stochastic systems, we investigate the properties of the set of high-probability sequences. In particular we provide results regarding the asymptotic behavior of frequencies of symbols and short words in a sequence evolving through interspersed duplication. The study of such a systems is an important step towards the design and analysis of more realistic and sophisticated models of genomic mutation processes

CiteSeerX

Crossref

Caltech Authors

Codes Correcting All Patterns of Tandem-Duplication Errors of Maximum Length 3

Author: Kovačević Mladen
Publication venue
Publication date: 15/11/2019
Field of study

The set of all

q

-ary strings that do not contain repeated substrings of length

\leq\ell

forms a code correcting all patterns of tandem-duplication errors of length

\leq\ell

, when

\ell \in \{1, 2, 3\}

. For

\ell \in \{1, 2\}

, this code is also known to be optimal in terms of asymptotic rate. The purpose of this paper is to demonstrate asymptotic optimality for the case

\ell = 3

as well, and to give the corresponding characterization of the zero-error capacity of the

(\leq 3)

-tandem-duplication channel. This settles the zero-error problem for

(\leq\ell)

-tandem-duplication channels in all cases where duplication roots of strings are unique.Comment: 5 pages (double-column format

arXiv.org e-Print Archive

Attaining the 2nd Chargaff Rule by Tandem Duplications

Author: Bruck Jehoshua
Jain Siddharth
Raviv Netanel
Publication venue: 'California Institute of Technology Library'
Publication date: 04/01/2018
Field of study

Erwin Chargaff in 1950 made an experimental observation that the count of A is equal to the count of T and the count of C is equal to the count of G in DNA. This observation played a crucial rule in the discovery of the double stranded helix structure by Watson and Crick. However, this symmetry was also observed in single stranded DNA. This phenomenon was termed as 2nd Chargaff Rule. This symmetry has been verified experimentally in genomes of several different species not only for mononucleotides but also for reverse complement pairs of larger lengths up to a small error. While the symmetry in double stranded DNA is related to base pairing, and replication mechanisms, the symmetry in a single stranded DNA is still a mystery in its function and source. In this work, we define a sequence generation model based on reverse complement tandem duplications. We show that this model generates sequences that satisfy the 2nd Chargaff Rule even when the duplication lengths are very small when compared to the length of sequences. We also provide estimates on the number of generations that are needed by this model to generate sequences that satisfy 2nd Chargaff Rule. We provide theoretical bounds on the disruption in symmetry for different values of duplication lengths under this model. Moreover, we experimentally compare the disruption in the symmetry incurred by our model with what is observed in human genome data

Crossref

Caltech Authors