27 research outputs found
The Capacity of Some P\'olya String Models
We study random string-duplication systems, which we call P\'olya string
models. These are motivated by DNA storage in living organisms, and certain
random mutation processes that affect their genome. Unlike previous works that
study the combinatorial capacity of string-duplication systems, or various
string statistics, this work provides exact capacity or bounds on it, for
several probabilistic models. In particular, we study the capacity of noisy
string-duplication systems, including the tandem-duplication, end-duplication,
and interspersed-duplication systems. Interesting connections are drawn between
some systems and the signature of random permutations, as well as to the beta
distribution common in population genetics
Codes Correcting All Patterns of Tandem-Duplication Errors of Maximum Length 3
The set of all -ary strings that do not contain repeated substrings of
length forms a code correcting all patterns of tandem-duplication
errors of length , when . For , this code is also known to be optimal in terms of asymptotic rate.
The purpose of this paper is to demonstrate asymptotic optimality for the case
as well, and to give the corresponding characterization of the
zero-error capacity of the -tandem-duplication channel. This
settles the zero-error problem for -tandem-duplication channels
in all cases where duplication roots of strings are unique.Comment: 5 pages (double-column format
The capacity of some Pólya string models
We study random string-duplication systems, called Pólya string models, motivated by certain random mutation processes in the genome of living organisms. Unlike previous works that study the combinatorial capacity of string-duplication systems, or peripheral properties such as symbol frequency, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we give the exact capacity of the random tandem-duplication system, and the end-duplication system, and bound the capacity of the complement tandem-duplication system. Interesting connections are drawn between the former and the beta distribution common to population genetics, as well as between the latter system and signatures of random permutations
Evolution of k-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that k-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems
The Tandem Duplication Distance Problem is hard over bounded alphabets
A tandem duplication denotes the process of inserting a copy of a segment of
DNA adjacent to its original position. More formally, a tandem duplication can
be thought of as an operation that converts a string into a string As they appear to be involved in genetic disorders, tandem
duplications are widely studied in computational biology. Also, tandem
duplication mechanisms have been recently studied in different contexts, from
formal languages, to information theory, to error-correcting codes for DNA
storage systems.
The problem of determining the complexity of computing the tandem duplication
distance between two given strings was proposed by [Leupold et al., 2004] and,
very recently, it was shown to be NP-hard for the case of unbounded alphabets
[Lafond et al., STACS2020]. In this paper, we significantly improve this result
and show that the tandem duplication distance problem is NP-hard already for
the case of strings over an alphabet of size We also study some
special classes of strings were it is possible to give linear time solutions to
the existence problem: given strings and over the same alphabet, decide
whether there exists a sequence of duplications converting into . A
polynomial time algorithm that solves the existence problem was only known for
the case of the binary alphabet
Low-redundancy codes for correcting multiple short-duplication and edit errors
Due to its higher data density, longevity, energy efficiency, and ease of
generating copies, DNA is considered a promising storage technology for
satisfying future needs. However, a diverse set of errors including deletions,
insertions, duplications, and substitutions may arise in DNA at different
stages of data storage and retrieval. The current paper constructs
error-correcting codes for simultaneously correcting short (tandem)
duplications and at most edits, where a short duplication generates a copy
of a substring with length and inserts the copy following the original
substring, and an edit is a substitution, deletion, or insertion. Compared to
the state-of-the-art codes for duplications only, the proposed codes correct up
to edits (in addition to duplications) at the additional cost of roughly
symbols of redundancy, thus achieving the same
asymptotic rate, where is the alphabet size and is a constant.
Furthermore, the time complexities of both the encoding and decoding processes
are polynomial when is a constant with respect to the code length.Comment: 21 pages. The paper has been submitted to IEEE Transaction on
Information Theory. Furthermore, the paper was presented in part at the
ISIT2021 and ISIT202
Evolution of k-mer Frequencies and Entropy in Duplication and Substitution Mutation Systems
Genomic evolution can be viewed as string-editing processes driven by mutations. An understanding of the statistical properties resulting from these mutation processes is of value in a variety of tasks related to biological sequence data, e.g., estimation of model parameters and compression. At the same time, due to the complexity of these processes, designing tractable stochastic models and analyzing them are challenging. In this paper, we study two kinds of systems, each representing a set of mutations. In the first system, tandem duplications and substitution mutations are allowed and in the other, interspersed duplications. We provide stochastic models and, via stochastic approximation, study the evolution of substring frequencies for these two systems separately. Specifically, we show that k-mer frequencies converge almost surely and determine the limit set. Furthermore, we present a method for finding upper bounds on entropy for such systems
Decoding the Past
The human genome is continuously evolving, hence the sequenced genome is a snapshot in time of this evolving entity. Over time, the genome accumulates mutations that can be associated with different phenotypes - like physical traits, diseases, etc. Underlying mutation accumulation is an evolution channel (the term channel is motivated by the notion of communication channel introduced by Shannon [1] in 1948 and started the area of Information Theory), which is controlled by hereditary, environmental, and stochastic factors. The premise of this thesis is to understand the human genome using information theory framework. In particular, it focuses on: (i) the analysis and characterization of the evolution channel using measures of capacity, expressiveness, evolution distance, and uniqueness of ancestry and uses these insights for (ii) the design of error correcting codes for DNA storage, (iii) inversion symmetry in the genome and (iv) cancer classification.
The mutational events characterizing this evolution channel can be divided into two categories, namely point mutations and duplications. While evolution through point mutations is unconstrained, giving rise to combinatorially many possibilities of what could have happened in the past, evolution through duplications adds constraints limiting the number of those possibilities. Further, more than 50% of the genome has been observed to consist of repeated sequences. We focus on the much constrained form of duplications known as tandem duplications in order to understand the limits of evolution by duplication. Our sequence evolution model consists of a starting sequence called seed and a set of tandem duplication rules. We find limits on the diversity of sequences that can be generated by tandem duplications using measures of capacity and expressiveness. Additionally, we calculate bounds on the duplication distance which is used to measure the timing of generation by these duplications. We also ask questions about the uniqueness of seed for a given sequence and completely characterize the duplication length sets where the seed is unique or non-unique. These insights also led us to design error correcting codes for any number of tandem duplication errors that are useful for DNA-storage based applications. For uniform duplication length and duplication length bounded by 2, our designed codes achieve channel capacity. We also define and measure uncertainty in decoding when the duplication channel is misinformed. Moreover, we add substitutions to our tandem duplication model and calculate sequence generation diversity for a given budget of substitutions.
We also use our duplication model to explain the inversion symmetry observed in the genome of many species. The inversion symmetry is popularly known as the 2nd Chargaff Rule, according to which in a single strand DNA, the frequency of a k-mer is almost the same as the frequency of its reverse complement. The insights gained by these problems led us to investigate the tandem repeat regions in the genome. Tandem repeat regions in the genome can be traced back in time algorithmically to make inference about the effect of the hereditary, environmental and stochastic factors on the mutation rate of the genome. By inferring the evolutionary history of the tandem repeat regions, we show how this knowledge can be used to make predictions about the risk of incurring a mutation based disease, specifically cancer. More precisely, we introduce the concept of mutation profiles that are computed without any comparative analysis, but instead by analyzing the short tandem repeat regions in a single healthy genome and capturing information about the individual's evolution channel. Using gradient boosting on data from more than 5,000 TCGA (The Cancer Genome Atlas) cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.</p