3,395 research outputs found
The Tandem Duplication Distance Is NP-Hard
In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology.
The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems
The Tandem Duplication Distance Problem is hard over bounded alphabets
A tandem duplication denotes the process of inserting a copy of a segment of
DNA adjacent to its original position. More formally, a tandem duplication can
be thought of as an operation that converts a string into a string As they appear to be involved in genetic disorders, tandem
duplications are widely studied in computational biology. Also, tandem
duplication mechanisms have been recently studied in different contexts, from
formal languages, to information theory, to error-correcting codes for DNA
storage systems.
The problem of determining the complexity of computing the tandem duplication
distance between two given strings was proposed by [Leupold et al., 2004] and,
very recently, it was shown to be NP-hard for the case of unbounded alphabets
[Lafond et al., STACS2020]. In this paper, we significantly improve this result
and show that the tandem duplication distance problem is NP-hard already for
the case of strings over an alphabet of size We also study some
special classes of strings were it is possible to give linear time solutions to
the existence problem: given strings and over the same alphabet, decide
whether there exists a sequence of duplications converting into . A
polynomial time algorithm that solves the existence problem was only known for
the case of the binary alphabet
The Distance and Median Problems in the Single-Cut-Or-Join Model with Single-Gene Duplications
Background.
In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model.
Results.
We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data.
Conclusion.
Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances
Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms
Recently, due to the genomic sequence analysis in several types of cancer,
the genomic data based on {\em copy number profiles} ({\em CNP} for short) are
getting more and more popular. A CNP is a vector where each component is a
non-negative integer representing the number of copies of a specific gene or
segment of interest.
In this paper, we present two streams of results. The first is the negative
results on two open problems regarding the computational complexity of the
Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018.
It was shown by Qingge et al. that the problem is NP-hard if the duplications
are tandem and they left the open question of whether the problem remains
NP-hard if arbitrary duplications are used. We answer this question
affirmatively in this paper; in fact, we prove that it is NP-hard to even
obtain a constant factor approximation. We also prove that the parameterized
version is W[1]-hard, answering another open question by Qingge et al.
The other result is positive and is based on a new (and more general) problem
regarding CNP's. The \emph{Copy Number Profile Conforming (CNPC)} problem is
formally defined as follows: given two CNP's and , compute two
strings and with and such that the
distance between and , , is minimized. Here,
is a very general term, which means it could be any genome
rearrangement distance (like reversal, transposition, and tandem duplication,
etc). We make the first step by showing that if is measured by the
breakpoint distance then the problem is polynomially solvable.Comment: 16 pages, 3 figure
The Longest Subsequence-Repeated Subsequence Problem
Motivated by computing duplication patterns in sequences, a new fundamental
problem called the longest subsequence-repeated subsequence (LSRS) is proposed.
Given a sequence of length , a letter-repeated subsequence is a
subsequence of in the form of with
a subsequence of , and for all in
and in . We first present an time algorithm to
compute the longest cubic subsequences of all the substrings of ,
improving the trivial bound. Then, an time algorithm for
computing the longest subsequence-repeated subsequence (LSRS) of is
obtained. Finally we focus on two variants of this problem. We first consider
the constrained version when is unbounded, each letter appears in
at most times and all the letters in must appear in the solution.
We show that the problem is NP-hard for , via a reduction from a special
version of SAT (which is obtained from 3-COLORING). We then show that when each
letter appears in at most times, then the problem is solvable in
time.Comment: 16 pages, 1 figur
The Capacity of Some P\'olya String Models
We study random string-duplication systems, which we call P\'olya string
models. These are motivated by DNA storage in living organisms, and certain
random mutation processes that affect their genome. Unlike previous works that
study the combinatorial capacity of string-duplication systems, or various
string statistics, this work provides exact capacity or bounds on it, for
several probabilistic models. In particular, we study the capacity of noisy
string-duplication systems, including the tandem-duplication, end-duplication,
and interspersed-duplication systems. Interesting connections are drawn between
some systems and the signature of random permutations, as well as to the beta
distribution common in population genetics
Integrated multiple sequence alignment
Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases.
Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility.
Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs.
Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model
The Rooted SCJ Median with Single Gene Duplications
The median problem is a classical problem in genome rearrangements. It aims to compute a gene order that minimizes the sum of the genomic distances to k>=3 given gene orders. This problem is intractable except in the related Single-Cut-or-Join and breakpoint rearrangement models. Here we consider the rooted median problem, where we assume one of the given genomes to be ancestral to the median, which is itself ancestral to the other genomes. We show that in the Single-Cut-or-Join model with single gene duplications, the rooted median problem is NP-hard. We also describe an Integer Linear Program for solving this problem, which we apply to simulated data, showing high accuracy of the reconstructed medians
- …