3,395 research outputs found

    The Tandem Duplication Distance Is NP-Hard

    Get PDF
    In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems

    The Tandem Duplication Distance Problem is hard over bounded alphabets

    Full text link
    A tandem duplication denotes the process of inserting a copy of a segment of DNA adjacent to its original position. More formally, a tandem duplication can be thought of as an operation that converts a string S=AXBS = AXB into a string T=AXXB.T = AXXB. As they appear to be involved in genetic disorders, tandem duplications are widely studied in computational biology. Also, tandem duplication mechanisms have been recently studied in different contexts, from formal languages, to information theory, to error-correcting codes for DNA storage systems. The problem of determining the complexity of computing the tandem duplication distance between two given strings was proposed by [Leupold et al., 2004] and, very recently, it was shown to be NP-hard for the case of unbounded alphabets [Lafond et al., STACS2020]. In this paper, we significantly improve this result and show that the tandem duplication distance problem is NP-hard already for the case of strings over an alphabet of size 5.\leq 5. We also study some special classes of strings were it is possible to give linear time solutions to the existence problem: given strings SS and TT over the same alphabet, decide whether there exists a sequence of duplications converting SS into TT. A polynomial time algorithm that solves the existence problem was only known for the case of the binary alphabet

    The Distance and Median Problems in the Single-Cut-Or-Join Model with Single-Gene Duplications

    Get PDF
    Background. In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model. Results. We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data. Conclusion. Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances

    Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

    Get PDF
    Recently, due to the genomic sequence analysis in several types of cancer, the genomic data based on {\em copy number profiles} ({\em CNP} for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific gene or segment of interest. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. We also prove that the parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP's. The \emph{Copy Number Profile Conforming (CNPC)} problem is formally defined as follows: given two CNP's C1C_1 and C2C_2, compute two strings S1S_1 and S2S_2 with cnp(S1)=C1cnp(S_1)=C_1 and cnp(S2)=C2cnp(S_2)=C_2 such that the distance between S1S_1 and S2S_2, d(S1,S2)d(S_1,S_2), is minimized. Here, d(S1,S2)d(S_1,S_2) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S1,S2)d(S_1,S_2) is measured by the breakpoint distance then the problem is polynomially solvable.Comment: 16 pages, 3 figure

    The Longest Subsequence-Repeated Subsequence Problem

    Full text link
    Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest subsequence-repeated subsequence (LSRS) is proposed. Given a sequence SS of length nn, a letter-repeated subsequence is a subsequence of SS in the form of x1d1x2d2xkdkx_1^{d_1}x_2^{d_2}\cdots x_k^{d_k} with xix_i a subsequence of SS, xjxj+1x_j\neq x_{j+1} and di2d_i\geq 2 for all ii in [k][k] and jj in [k1][k-1]. We first present an O(n6)O(n^6) time algorithm to compute the longest cubic subsequences of all the O(n2)O(n^2) substrings of SS, improving the trivial O(n7)O(n^7) bound. Then, an O(n6)O(n^6) time algorithm for computing the longest subsequence-repeated subsequence (LSRS) of SS is obtained. Finally we focus on two variants of this problem. We first consider the constrained version when Σ\Sigma is unbounded, each letter appears in SS at most dd times and all the letters in Σ\Sigma must appear in the solution. We show that the problem is NP-hard for d=4d=4, via a reduction from a special version of SAT (which is obtained from 3-COLORING). We then show that when each letter appears in SS at most d=3d=3 times, then the problem is solvable in O(n5)O(n^5) time.Comment: 16 pages, 1 figur

    The Capacity of Some P\'olya String Models

    Get PDF
    We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics

    Integrated multiple sequence alignment

    Get PDF
    Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases. Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility. Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs. Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model

    The Rooted SCJ Median with Single Gene Duplications

    Get PDF
    The median problem is a classical problem in genome rearrangements. It aims to compute a gene order that minimizes the sum of the genomic distances to  k>=3  given gene orders. This problem is intractable except in the related Single-Cut-or-Join and breakpoint rearrangement models. Here we consider the rooted median problem, where we assume one of the given genomes to be ancestral to the median, which is itself ancestral to the other genomes. We show that in the Single-Cut-or-Join model with single gene duplications, the rooted median problem is NP-hard. We also describe an Integer Linear Program for solving this problem, which we apply to simulated data, showing high accuracy of the reconstructed medians