320 research outputs found
The Longest Common Exemplar Subsequence Problem
In this paper, we propose to find order conserved subsequences of genomes by finding longest common exemplar subsequences of the genomes. The longest common exemplar subsequence problem is given by two genomes, asks to find a common exemplar subsequence of them, such that the exemplar subsequence length is maximized. We focus on genomes whose genes of the same gene family are in at most s spans. We propose a dynamic programming algorithm with time complexity O(s4 s mn) to find a longest common exemplar subsequence of two genomes with one genome admitting s span genes of the same gene family, where m, n stand for the gene numbers of those two given genomes. Our algorithm can be extended to find longest common exemplar subsequences of more than one genomes
Exemplar Longest Common Subsequence (extended abstract)
International audienceIn the paper we investigate the computational and approximation complexity of the Exemplar Longest Common Subsequence of a set of sequences (ELCS problem), a generalization of the Longest Common Subsequence problem, where the input sequences are over the union of two disjoint sets of symbols, a set of mandatory symbols and a set of optional symbols. We show that different versions of the problem are APX-hard even for instances with two sequences. Moreover, we show that the related problem of determining the existence of a feasible solution of the Exemplar Longest Common Subsequence of two sequences is NP-hard. On the positive side, efficient algorithms for the ELCS problem over instances of two sequences where each mandatory symbol can appear totally at most three times or the number of mandatory symbols is bounded by a constant are given
The zero exemplar distance problem
Given two genomes with duplicate genes, \textsc{Zero Exemplar Distance} is
the problem of deciding whether the two genomes can be reduced to the same
genome without duplicate genes by deleting all but one copy of each gene in
each genome. Blin, Fertin, Sikora, and Vialette recently proved that
\textsc{Zero Exemplar Distance} for monochromosomal genomes is NP-hard even if
each gene appears at most two times in each genome, thereby settling an
important open question on genome rearrangement in the exemplar model. In this
paper, we give a very simple alternative proof of this result. We also study
the problem \textsc{Zero Exemplar Distance} for multichromosomal genomes
without gene order, and prove the analogous result that it is also NP-hard even
if each gene appears at most two times in each genome. For the positive
direction, we show that both variants of \textsc{Zero Exemplar Distance} admit
polynomial-time algorithms if each gene appears exactly once in one genome and
at least once in the other genome. In addition, we present a polynomial-time
algorithm for the related problem \textsc{Exemplar Longest Common Subsequence}
in the special case that each mandatory symbol appears exactly once in one
input sequence and at least once in the other input sequence. This answers an
open question of Bonizzoni et al. We also show that \textsc{Zero Exemplar
Distance} for multichromosomal genomes without gene order is fixed-parameter
tractable if the parameter is the maximum number of chromosomes in each genome.Comment: Strengthened and reorganize
Variants of Constrained Longest Common Subsequence
In this work, we consider a variant of the classical Longest Common
Subsequence problem called Doubly-Constrained Longest Common Subsequence
(DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings,
and a function Co from A to N, the DC-LCS problem consists in finding the
longest subsequence s of s1 and s2 such that s is a supersequence of all the
strings in Cs and such that the number of occurrences in s of each symbol a in
A is upper bounded by Co(a). The DC-LCS problem provides a clear mathematical
formulation of a sequence comparison problem in Computational Biology and
generalizes two other constrained variants of the LCS problem: the Constrained
LCS and the Repetition-Free LCS. We present two results for the DC-LCS problem.
First, we illustrate a fixed-parameter algorithm where the parameter is the
length of the solution. Secondly, we prove a parameterized hardness result for
the Constrained LCS problem when the parameter is the number of the constraint
strings and the size of the alphabet A. This hardness result also implies the
parameterized hardness of the DC-LCS problem (with the same parameters) and its
NP-hardness when the size of the alphabet is constant
Heuristic algorithms for the Longest Filled Common Subsequence Problem
At CPM 2017, Castelli et al. define and study a new variant of the Longest
Common Subsequence Problem, termed the Longest Filled Common Subsequence
Problem (LFCS). For the LFCS problem, the input consists of two strings and
and a multiset of characters . The goal is to insert the
characters from into the string , thus obtaining a new string
, such that the Longest Common Subsequence (LCS) between and is
maximized. Casteli et al. show that the problem is NP-hard and provide a
3/5-approximation algorithm for the problem.
In this paper we study the problem from the experimental point of view. We
introduce, implement and test new heuristic algorithms and compare them with
the approximation algorithm of Casteli et al. Moreover, we introduce an Integer
Linear Program (ILP) model for the problem and we use the state of the art ILP
solver, Gurobi, to obtain exact solution for moderate sized instances.Comment: Accepted and presented as a proceedings paper at SYNASC 201
Repetition-free longest common subsequence of random sequences
A repetition free Longest Common Subsequence (LCS) of two sequences x and y
is an LCS of x and y where each symbol may appear at most once. Let R denote
the length of a repetition free LCS of two sequences of n symbols each one
chosen randomly, uniformly, and independently over a k-ary alphabet. We study
the asymptotic, in n and k, behavior of R and establish that there are three
distinct regimes, depending on the relative speed of growth of n and k. For
each regime we establish the limiting behavior of R. In fact, we do more, since
we actually establish tail bounds for large deviations of R from its limiting
behavior.
Our study is motivated by the so called exemplar model proposed by Sankoff
(1999) and the related similarity measure introduced by Adi et al. (2007). A
natural question that arises in this context, which as we show is related to
long standing open problems in the area of probabilistic combinatorics, is to
understand the asymptotic, in n and k, behavior of parameter R.Comment: 15 pages, 1 figur
The Extended Edit Distance Metric
Similarity search is an important problem in information retrieval. This
similarity is based on a distance. Symbolic representation of time series has
attracted many researchers recently, since it reduces the dimensionality of
these high dimensional data objects. We propose a new distance metric that is
applied to symbolic data objects and we test it on time series data bases in a
classification task. We compare it to other distances that are well known in
the literature for symbolic data objects. We also prove, mathematically, that
our distance is metric.Comment: Technical repor
The Longest Filled Common Subsequence Problem
Inspired by a recent approach for genome reconstruction from incomplete data, we consider a variant of the longest common subsequence problem for the comparison of two sequences, one of which is incomplete, i.e. it has some missing elements. The new combinatorial problem, called Longest Filled Common Subsequence, given two sequences A and B, and a multiset M of symbols missing in B, asks for a sequence B* obtained by inserting the symbols of M into B so that B* induces a common subsequence with A of maximum length. First, we investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard when A contains at most two occurrences of each symbol. Then, we give a 3/5 approximation algorithm for the problem. Finally, we present a fixed-parameter algorithm, when the problem is parameterized by the number of symbols inserted in B that "match" symbols of A
Efficient Tools for Computing the Number of Breakpoints and the Number of Adjacencies between two Genomes with Duplicate Genes
International audienceComparing genomes of different species is a fundamental problem in comparative genomics. Recent research has resulted in the introduction of different measures between pairs of genomes: reversal distance, number of breakpoints, number of common or conserved intervals, etc. However, classical methods used for computing such measures are seriously compromised when genomes have several copies of the same gene scattered across them. Most approaches to overcome this difficulty are based either on the exemplar model, which keeps exactly one copy in each genome of each duplicated gene, or on the maximum matching model, which keeps as many copies as possible of each duplicated gene. The goal is to find an exemplar matching, respectively a maximum matching, that optimizes the studied measure. Unfortunately, it turns out that, in presence of duplications, this problem for each above-mentioned measure is NP-hard. In this paper, we propose to compute the minimum number of breakpoints and the maximum number of adjacencies between two genomes in presence of duplications using two different approaches. The first one is a (exact) generic 0–1 linear programming approach, while the second is a collection of three heuristics. Each of these approaches is applied on each problem and for each of the following models: exemplar, maximum matching and intermediate model, that we introduce here. All these programs are run on a well-known public benchmark dataset of -Proteobacteria, and their performances are discussed
Master Texture Space: An Efficient Encoding for Projectively Mapped Objects
Projectively textured models are used in an increasingly large number of applicationsthat dynamically combine images with a simple geometric surface in a viewpoint dependentway. These models can provide visual fidelity while retaining the effects affordedby geometric approximation such as shadow casting and accurate perspective distortion.However, the number of stored views can be quite large and novel views must be synthesizedduring the rendering process because no single view may correctly texture the entireobject surface. This work introduces the Master Texture encoding and demonstrates thatthe encoding increases the utility of projectively textured objects by reducing render-timeoperations. Encoding involves three steps; 1) all image regions that correspond to the samegeometric mesh element are extracted and warped to a facet of uniform size and shape,2) an efficient packing of these facets into a new Master Texture image is computed, and3) the visibility of each pixel in the new Master Texture data is guaranteed using a simplealgorithm to discard occluded pixels in each view. Because the encoding implicitly representsthe multi-view geometry of the multiple images, a single texture mesh is sufficientto render the view-dependent model. More importantly, every Master Texture image cancorrectly texture the entire surface of the object, removing expensive computations suchas visibility analysis from the rendering algorithm. A benefit of this encoding is the supportfor pixel-wise view synthesis. The utility of pixel-wise view synthesis is demonstratedwith a real-time Master Texture encoded VDTM application. Pixel-wise synthesis is alsodemonstrated with an algorithm that distills a set of Master Texture images to a singleview-independent Master Texture image
- …