Search CORE

27 research outputs found

The Capacity of Some P\'olya String Models

Author: Bruck Jehoshua
Elishco Ohad
Farnoud Farzad
Schwartz Moshe
Publication venue
Publication date: 01/08/2018
Field of study

We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics

arXiv.org e-Print Archive

Caltech Authors

Codes Correcting All Patterns of Tandem-Duplication Errors of Maximum Length 3

Author: Kovačević Mladen
Publication venue
Publication date: 15/11/2019
Field of study

The set of all

q

-ary strings that do not contain repeated substrings of length

\leq\ell

forms a code correcting all patterns of tandem-duplication errors of length

\leq\ell

, when

\ell \in \{1, 2, 3\}

. For

\ell \in \{1, 2\}

, this code is also known to be optimal in terms of asymptotic rate. The purpose of this paper is to demonstrate asymptotic optimality for the case

\ell = 3

as well, and to give the corresponding characterization of the zero-error capacity of the

(\leq 3)

-tandem-duplication channel. This settles the zero-error problem for

(\leq\ell)

-tandem-duplication channels in all cases where duplication roots of strings are unique.Comment: 5 pages (double-column format

arXiv.org e-Print Archive

The capacity of some Pólya string models

Author: Bruck Jehoshua
Elishco Ohad
Farnoud Farzad
Schwartz Moshe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/07/2016
Field of study

We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we give the exact capacity of the random tandem-duplication system, and the end-duplication system, and bound the capacity of the complement tandem-duplication system. Interesting connections are drawn between the former and the beta distribution common to population genetics, as well as between the latter system and signatures of random permutations

Caltech Authors

Evolution of k-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems

Author: Bruck Jehoshua
Farnoud (Hassanzadeh) Farzad
Lou Hao
Schwartz Moshe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2020
Field of study

Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that k-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems

Caltech Authors

The Tandem Duplication Distance Problem is hard over bounded alphabets

Author: Cicalese Ferdinando
Pilati Nicolò
Publication venue
Publication date: 01/01/2021
Field of study

A tandem duplication denotes the process of inserting a copy of a segment of DNA adjacent to its original position. More formally, a tandem duplication can be thought of as an operation that converts a string

S = AXB

into a string

T = AXXB.

As they appear to be involved in genetic disorders, tandem duplications are widely studied in computational biology. Also, tandem duplication mechanisms have been recently studied in different contexts, from formal languages, to information theory, to error-correcting codes for DNA storage systems. The problem of determining the complexity of computing the tandem duplication distance between two given strings was proposed by [Leupold et al., 2004] and, very recently, it was shown to be NP-hard for the case of unbounded alphabets [Lafond et al., STACS2020]. In this paper, we significantly improve this result and show that the tandem duplication distance problem is NP-hard already for the case of strings over an alphabet of size

\leq 5.

We also study some special classes of strings were it is possible to give linear time solutions to the existence problem: given strings

S

and

T

over the same alphabet, decide whether there exists a sequence of duplications converting

S

into

T

. A polynomial time algorithm that solves the existence problem was only known for the case of the binary alphabet

arXiv.org e-Print Archive

Catalogo dei prodotti della ricerca

Low-redundancy codes for correcting multiple short-duplication and edit errors

Author: Farnoud Farzad
Gabrys Ryan
Lou Hao
Tang Yuanyuan
Wang Shuche
Publication venue
Publication date: 03/08/2022
Field of study

Due to its higher data density, longevity, energy efficiency, and ease of generating copies, DNA is considered a promising storage technology for satisfying future needs. However, a diverse set of errors including deletions, insertions, duplications, and substitutions may arise in DNA at different stages of data storage and retrieval. The current paper constructs error-correcting codes for simultaneously correcting short (tandem) duplications and at most

p

edits, where a short duplication generates a copy of a substring with length

\leq 3

and inserts the copy following the original substring, and an edit is a substitution, deletion, or insertion. Compared to the state-of-the-art codes for duplications only, the proposed codes correct up to

p

edits (in addition to duplications) at the additional cost of roughly

8p(\log_q n)(1+o(1))

symbols of redundancy, thus achieving the same asymptotic rate, where

q\ge 4

is the alphabet size and

p

is a constant. Furthermore, the time complexities of both the encoding and decoding processes are polynomial when

p

is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on Information Theory. Furthermore, the paper was presented in part at the ISIT2021 and ISIT202

arXiv.org e-Print Archive

Evolution of k-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems

Author: Bruck Jehoshua
Farnoud (Hassanzadeh) Farzad
Lou Hao
Schwartz Moshe
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2020
Field of study

Decoding the Past

Author: Jain Siddharth
Publication venue
Publication date: 01/01/2019
Field of study

The human genome is continuously evolving, hence the sequenced genome is a snapshot in time of this evolving entity. Over time, the genome accumulates mutations that can be associated with different phenotypes - like physical traits, diseases, etc. Underlying mutation accumulation is an evolution channel (the term channel is motivated by the notion of communication channel introduced by Shannon [1] in 1948 and started the area of Information Theory), which is controlled by hereditary, environmental, and stochastic factors. The premise of this thesis is to understand the human genome using information theory framework. In particular, it focuses on: (i) the analysis and characterization of the evolution channel using measures of capacity, expressiveness, evolution distance, and uniqueness of ancestry and uses these insights for (ii) the design of error correcting codes for DNA storage, (iii) inversion symmetry in the genome and (iv) cancer classification. The mutational events characterizing this evolution channel can be divided into two categories, namely point mutations and duplications. While evolution through point mutations is unconstrained, giving rise to combinatorially many possibilities of what could have happened in the past, evolution through duplications adds constraints limiting the number of those possibilities. Further, more than 50% of the genome has been observed to consist of repeated sequences. We focus on the much constrained form of duplications known as tandem duplications in order to understand the limits of evolution by duplication. Our sequence evolution model consists of a starting sequence called seed and a set of tandem duplication rules. We find limits on the diversity of sequences that can be generated by tandem duplications using measures of capacity and expressiveness. Additionally, we calculate bounds on the duplication distance which is used to measure the timing of generation by these duplications. We also ask questions about the uniqueness of seed for a given sequence and completely characterize the duplication length sets where the seed is unique or non-unique. These insights also led us to design error correcting codes for any number of tandem duplication errors that are useful for DNA-storage based applications. For uniform duplication length and duplication length bounded by 2, our designed codes achieve channel capacity. We also define and measure uncertainty in decoding when the duplication channel is misinformed. Moreover, we add substitutions to our tandem duplication model and calculate sequence generation diversity for a given budget of substitutions. We also use our duplication model to explain the inversion symmetry observed in the genome of many species. The inversion symmetry is popularly known as the 2nd Chargaff Rule, according to which in a single strand DNA, the frequency of a k-mer is almost the same as the frequency of its reverse complement. The insights gained by these problems led us to investigate the tandem repeat regions in the genome. Tandem repeat regions in the genome can be traced back in time algorithmically to make inference about the effect of the hereditary, environmental and stochastic factors on the mutation rate of the genome. By inferring the evolutionary history of the tandem repeat regions, we show how this knowledge can be used to make predictions about the risk of incurring a mutation based disease, specifically cancer. More precisely, we introduce the concept of mutation profiles that are computed without any comparative analysis, but instead by analyzing the short tandem repeat regions in a single healthy genome and capturing information about the individual's evolution channel. Using gradient boosting on data from more than 5,000 TCGA (The Cancer Genome Atlas) cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.</p

Caltech Theses and Dissertations