11,649 research outputs found

    Duplication Distance to the Root for Binary Sequences

    Get PDF
    We study the tandem duplication distance between binary sequences and their roots. In other words, the quantity of interest is the number of tandem duplication operations of the form x = abc → y = abbc, where x and y are sequences and a, b, and c are their substrings, needed to generate a binary sequence of length n starting from a square-free sequence from the set {0, 1, 01, 10, 010, 101}. This problem is a restricted case of finding the duplication/deduplication distance between two sequences, defined as the minimum number of duplication and deduplication operations required to transform one sequence to the other. We consider both exact and approximate tandem duplications. For exact duplication, denoting the maximum distance to the root of a sequence of length n by f(n), we prove that f(n) = Θ(n). For the case of approximate duplication, where a β-fraction of symbols may be duplicated incorrectly, we show that the maximum distance has a sharp transition from linear in n to logarithmic at β = 1/2. We also study the duplication distance to the root for the set of sequences arising from a given root and for special classes of sequences, namely, the De Bruijn sequences, the Thue-Morse sequence, and the Fibonacci words. The problem is motivated by genomic tandem duplication mutations and the smallest number of tandem duplication events required to generate a given biological sequence

    Reconstruction Codes for DNA Sequences with Uniform Tandem-Duplication Errors

    Full text link
    DNA as a data storage medium has several advantages, including far greater data density compared to electronic media. We propose that schemes for data storage in the DNA of living organisms may benefit from studying the reconstruction problem, which is applicable whenever multiple reads of noisy data are available. This strategy is uniquely suited to the medium, which inherently replicates stored data in multiple distinct ways, caused by mutations. We consider noise introduced solely by uniform tandem-duplication, and utilize the relation to constant-weight integer codes in the Manhattan metric. By bounding the intersection of the cross-polytope with hyperplanes, we prove the existence of reconstruction codes with greater capacity than known error-correcting codes, which we can determine analytically for any set of parameters.Comment: 11 pages, 2 figures, Latex; version accepted for publicatio

    Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

    Get PDF
    We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.Comment: 16 pages, 11 figure

    Genetic Correlations in Mutation Processes

    Full text link
    We study the role of phylogenetic trees on correlations in mutation processes. Generally, correlations decay exponentially with the generation number. We find that two distinct regimes of behavior exist. For mutation rates smaller than a critical rate, the underlying tree morphology is almost irrelevant, while mutation rates higher than this critical rate lead to strong tree-dependent correlations. We show analytically that identical critical behavior underlies all multiple point correlations. This behavior generally characterizes branching processes undergoing mutation.Comment: revtex, 8 pages, 2 fig

    Modeling the evolution space of breakage fusion bridge cycles with a stochastic folding process

    Get PDF
    Breakage-Fusion-Bridge cycles in cancer arise when a broken segment of DNA is duplicated and an end from each copy joined together. This structure then 'unfolds' into a new piece of palindromic DNA. This is one mechanism responsible for the localised amplicons observed in cancer genome data. The process has parallels with paper folding sequences that arise when a piece of paper is folded several times and then unfolded. Here we adapt such methods to study the breakage-fusion-bridge structures in detail. We firstly consider discrete representations of this space with 2-d trees to demonstrate that there are 2^(n(n-1)/2) qualitatively distinct evolutions involving n breakage-fusion-bridge cycles. Secondly we consider the stochastic nature of the fold positions, to determine evolution likelihoods, and also describe how amplicons become localised. Finally we highlight these methods by inferring the evolution of breakage-fusion-bridge cycles with data from primary tissue cancer samples
    • …
    corecore