Search CORE

3,310 research outputs found

The Tandem Duplication Distance Is NP-Hard

Author: Lafond Manuel
Zhu Binhai
Zou Peng
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 37th International Symposium on Theoretical Aspects of Computer Science (STACS 2020)
Publication date: 12/06/2019
Field of study

In computational biology, tandem duplication is an important biological phenomenon which can occur either at the genome or at the DNA level. A tandem duplication takes a copy of a genome segment and inserts it right after the segment - this can be represented as the string operation AXB ? AXXB. Tandem exon duplications have been found in many species such as human, fly or worm, and have been largely studied in computational biology. The Tandem Duplication (TD) distance problem we investigate in this paper is defined as follows: given two strings S and T over the same alphabet, compute the smallest sequence of tandem duplications required to convert S to T. The natural question of whether the TD distance can be computed in polynomial time was posed in 2004 by Leupold et al. and had remained open, despite the fact that tandem duplications have received much attention ever since. In this paper, we prove that this problem is NP-hard, settling the 16-year old open problem. We further show that this hardness holds even if all characters of S are distinct. This is known as the exemplar TD distance, which is of special relevance in bioinformatics. One of the tools we develop for the reduction is a new problem called the Cost-Effective Subgraph, for which we obtain W[1]-hardness results that might be of independent interest. We finally show that computing the exemplar TD distance between S and T is fixed-parameter tractable. Our results open the door to many other questions, and we conclude with several open problems

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

The Tandem Duplication Distance Problem is hard over bounded alphabets

Author: Cicalese Ferdinando
Pilati Nicolò
Publication venue
Publication date: 01/01/2021
Field of study

A tandem duplication denotes the process of inserting a copy of a segment of DNA adjacent to its original position. More formally, a tandem duplication can be thought of as an operation that converts a string

S = AXB

into a string

T = AXXB.

As they appear to be involved in genetic disorders, tandem duplications are widely studied in computational biology. Also, tandem duplication mechanisms have been recently studied in different contexts, from formal languages, to information theory, to error-correcting codes for DNA storage systems. The problem of determining the complexity of computing the tandem duplication distance between two given strings was proposed by [Leupold et al., 2004] and, very recently, it was shown to be NP-hard for the case of unbounded alphabets [Lafond et al., STACS2020]. In this paper, we significantly improve this result and show that the tandem duplication distance problem is NP-hard already for the case of strings over an alphabet of size

\leq 5.

We also study some special classes of strings were it is possible to give linear time solutions to the existence problem: given strings

S

and

T

over the same alphabet, decide whether there exists a sequence of duplications converting

S

into

T

. A polynomial time algorithm that solves the existence problem was only known for the case of the binary alphabet

arXiv.org e-Print Archive

Catalogo dei prodotti della ricerca

The Distance and Median Problems in the Single-Cut-Or-Join Model with Single-Gene Duplications

Author: Chauve Cedric
Feijao Pedro C.
Lafond Manuel
Mane Aniket C.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/05/2020
Field of study

Background. In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model. Results. We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data. Conclusion. Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances

Simon Fraser University Institutional Repository

Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms

Author: Lafond Manuel
Zhu Binhai
Zou Peng
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 01/01/2020
Field of study

Recently, due to the genomic sequence analysis in several types of cancer, the genomic data based on {\em copy number profiles} ({\em CNP} for short) are getting more and more popular. A CNP is a vector where each component is a non-negative integer representing the number of copies of a specific gene or segment of interest. In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. It was shown by Qingge et al. that the problem is NP-hard if the duplications are tandem and they left the open question of whether the problem remains NP-hard if arbitrary duplications are used. We answer this question affirmatively in this paper; in fact, we prove that it is NP-hard to even obtain a constant factor approximation. We also prove that the parameterized version is W[1]-hard, answering another open question by Qingge et al. The other result is positive and is based on a new (and more general) problem regarding CNP's. The \emph{Copy Number Profile Conforming (CNPC)} problem is formally defined as follows: given two CNP's

C_1

and

C_2

, compute two strings

S_1

and

S_2

with

cnp(S_1)=C_1

and

cnp(S_2)=C_2

such that the distance between

S_1

and

S_2

d(S_1,S_2)

, is minimized. Here,

d(S_1,S_2)

is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if

d(S_1,S_2)

is measured by the breakpoint distance then the problem is polynomially solvable.Comment: 16 pages, 3 figure

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

The Longest Subsequence-Repeated Subsequence Problem

Author: Lafond Mahuel
Lai Wenfeng
Liyanage Adiesha
Zhu Binhai
Publication venue
Publication date: 13/04/2023
Field of study

Motivated by computing duplication patterns in sequences, a new fundamental problem called the longest subsequence-repeated subsequence (LSRS) is proposed. Given a sequence

S

of length

n

, a letter-repeated subsequence is a subsequence of

S

in the form of

x_1^{d_1}x_2^{d_2}\cdots x_k^{d_k}

with

x_i

a subsequence of

S

x_j\neq x_{j+1}

and

d_i\geq 2

for all

i

[k]

and

j

[k-1]

. We first present an

O(n^6)

time algorithm to compute the longest cubic subsequences of all the

O(n^2)

substrings of

S

, improving the trivial

O(n^7)

bound. Then, an

O(n^6)

time algorithm for computing the longest subsequence-repeated subsequence (LSRS) of

S

is obtained. Finally we focus on two variants of this problem. We first consider the constrained version when

\Sigma

is unbounded, each letter appears in

S

at most

d

times and all the letters in

\Sigma

must appear in the solution. We show that the problem is NP-hard for

d=4

, via a reduction from a special version of SAT (which is obtained from 3-COLORING). We then show that when each letter appears in

S

at most

d=3

times, then the problem is solvable in

O(n^5)

time.Comment: 16 pages, 1 figur

arXiv.org e-Print Archive

The Capacity of Some P\'olya String Models

Author: Bruck Jehoshua
Elishco Ohad
Farnoud Farzad
Schwartz Moshe
Publication venue
Publication date: 01/08/2018
Field of study

We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the combinatorial capacity of string-duplication systems, or various string statistics, this work provides exact capacity or bounds on it, for several probabilistic models. In particular, we study the capacity of noisy string-duplication systems, including the tandem-duplication, end-duplication, and interspersed-duplication systems. Interesting connections are drawn between some systems and the signature of random permutations, as well as to the beta distribution common in population genetics

arXiv.org e-Print Archive

Caltech Authors

The Rooted SCJ Median with Single Gene Duplications

Author: Chauve Cedric
Feijão Pedro
Lafond Manuel
Mane Aniket C
Publication venue
Publication date: 01/10/2018
Field of study

The median problem is a classical problem in genome rearrangements. It aims to compute a gene order that minimizes the sum of the genomic distances to  k>=3  given gene orders. This problem is intractable except in the related Single-Cut-or-Join and breakpoint rearrangement models. Here we consider the rooted median problem, where we assume one of the given genomes to be ancestral to the median, which is itself ancestral to the other genomes. We show that in the Single-Cut-or-Join model with single gene duplications, the rooted median problem is NP-hard. We also describe an Integer Linear Program for solving this problem, which we apply to simulated data, showing high accuracy of the reconstructed medians

Simon Fraser University Institutional Repository

Integrated multiple sequence alignment

Author: Sammeth Michael
Publication venue: Bielefeld University
Publication date: 01/01/2005
Field of study

Sammeth M. Integrated multiple sequence alignment. Bielefeld (Germany): Bielefeld University; 2005.The thesis presents enhancements for automated and manual multiple sequence alignment: existing alignment algorithms are made more easily accessible and new algorithms are designed for difficult cases. Firstly, we introduce the QAlign framework, a graphical user interface for multiple sequence alignment. It comprises several state-of-the-art algorithms and supports their parameters by convenient dialogs. An alignment viewer with guided editing functionality can also highlight or print regions of the alignment. Also phylogenetic features are provided, e.g., distance-based tree reconstruction methods, corrections for multiple substitutions and a tree viewer. The modular concept and the platform-independent implementation guarantee an easy extensibility. Further, we develop a constrained version of the divide-and-conquer alignment such that it can be restricted by anchors found earlier with local alignments. It can be shown that this method shares attributes of both, local and global aligners, in the quality of results as well as in the computation time. We further modify the local alignment step to work on bipartite (or even multipartite) sets for sequences where repeats overshadow valuable sequence information. In the end a technique is established that can accurately align sequences containing eventually repeated motifs. Finally, another algorithm is presented that allows to compare tandem repeat sequences by aligning them with respect to their possible repeat histories. We describe an evolutionary model including tandem duplications and excisions, and give an exact algorithm to compare two sequences under this model

Publications at Bielefeld University

Analysis of Gene Order Evolution Beyond Single-Copy Genes

Author: A Bergeron
A Bergeron
A Siepel
A Xu
B Arden
B Ma
B Moret
B Vernot
C Chauve
C Zheng
C Zheng
C Zheng
C Zheng
C. Chauve
CM Zmasek
D Bader
D Bertrand
D Bertrand
D Durand
D Durand
D Fulkerson
D Sankoff
D Sankoff
D Sankoff
D Sankoff
D Sankoff
D Soltis
E Eichler
E Lyons
F Murat
F. Murat
G Blanc
G Blin
G Bourque
G Fertin
G Glusman
G Landau
G Shi
G Tesler
G Watterson
H Gavranovic
H Gavranović
I Wapinski
J Bowers
J Cotton
J Demuth
J Gordon
J Mixtacki
J Nadeau
J Salse
J-P Doyon
K Chen
K O’Brien
K Wolfe
L Zhang
L Zhang
M Alekseyev
M Goodman
M Hahn
M Lajoie
M Lajoie
M Lynch
M Muffato
M Sanderson
M Shannon
N El-Mabrouk
O Elemento
O Eulenstein
O Tremblay-Savard
P Bonizzoni
P Gorecki
P Pevzner
Q Zhu
R Guigó
R Hoberman
R LaRue
R Page
R Page
R Page
R Tatusov
R Warren
S Angibaud
S Hannenhalli
S Pham
S Schwartz
S Yancopoulos
S Yancopoulos
T Blomme
T Uno
T Vinař
V Bafna
V Shoja
W Fitch
W Li
WJ Kent
Z Adam
Z Fu
Z Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref