Search CORE

1,355 research outputs found

Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

Author: Schwartz Moshe
Yehezkeally Yonatan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/09/2019
Field of study

DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

arXiv.org e-Print Archive

Crossref

The Capacity of Some P\'olya String Models

Author: Bruck Jehoshua
Elishco Ohad
Farnoud Farzad
Schwartz Moshe
Publication venue
Publication date: 01/08/2018
Field of study

We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics

arXiv.org e-Print Archive

Caltech Authors

Low-redundancy codes for correcting multiple short-duplication and edit errors

Author: Farnoud Farzad
Gabrys Ryan
Lou Hao
Tang Yuanyuan
Wang Shuche
Publication venue
Publication date: 03/08/2022
Field of study

Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most

p

edits, where a short duplication generates a copy of a substring with length

\leq 3

and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to

p

edits (in addition to duplications) at the additional cost of roughly

8p(\log_q n)(1+o(1))

symbols of redundancy, thus achieving the same asymptotic rate, where

q\ge 4

is the alphabet size and

p

is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when

p

is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT202

arXiv.org e-Print Archive

The capacity of some Pólya string models

Author: Bruck Jehoshua
Elishco Ohad
Farnoud Farzad
Schwartz Moshe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2016
Field of study

We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we give the exact capacity of the random tandem-duplication system, and the end-duplication system, and bound the capacity of the complement tandem-duplication system. Interesting connections are drawn between the former and the beta distribution common to population genetics, as well as between the latter system and signatures of random permutations

Caltech Authors

Coding for Optimized Writing Rate in DNA Storage

Author: Bruck Jehoshua
Farnoud Farzad
Jain Siddharth
Schwartz Moshe
Publication venue
Publication date: 01/06/2020
Field of study

A method for encoding information in DNA sequences is described. The method is based on the precisionresolution framework, and is aimed to work in conjunction with a recently suggested terminator-free template independent DNA synthesis method. The suggested method optimizes the amount of information bits per synthesis time unit, namely, the writing rate. Additionally, the encoding scheme studied here takes into account the existence of multiple copies of the DNA sequence, which are independently distorted. Finally, quantizers for various run-length distributions are designed

Crossref

Caltech Authors