628 research outputs found
Approximating Weighted Duo-Preservation in Comparative Genomics
Motivated by comparative genomics, Chen et al. [9] introduced the Maximum
Duo-preservation String Mapping (MDSM) problem in which we are given two
strings and from the same alphabet and the goal is to find a
mapping between them so as to maximize the number of duos preserved. A
duo is any two consecutive characters in a string and it is preserved in the
mapping if its two consecutive characters in are mapped to same two
consecutive characters in . The MDSM problem is known to be NP-hard and
there are approximation algorithms for this problem [3, 5, 13], but all of them
consider only the "unweighted" version of the problem in the sense that a duo
from is preserved by mapping to any same duo in regardless of their
positions in the respective strings. However, it is well-desired in comparative
genomics to find mappings that consider preserving duos that are "closer" to
each other under some distance measure [19]. In this paper, we introduce a
generalized version of the problem, called the Maximum-Weight Duo-preservation
String Mapping (MWDSM) problem that captures both duos-preservation and
duos-distance measures in the sense that mapping a duo from to each
preserved duo in has a weight, indicating the "closeness" of the two
duos. The objective of the MWDSM problem is to find a mapping so as to maximize
the total weight of preserved duos. In this paper, we give a polynomial-time
6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and
Combinatorics Conference (COCOON 2017
Beyond Adjacency Maximization: Scaffold Filling for New String Distances
International audienceIn Genomic Scaffold Filling, one aims at polishing in silico a draft genome, called scaffold. The scaffold is given in the form of an ordered set of gene sequences, called contigs. This is done by confronting the scaffold to an already complete reference genome from a close species. More precisely, given a scaffold S, a reference genome G and a score function f () between two genomes, the aim is to complete S by adding the missing genes from G so that the obtained complete genome S * optimizes f (S * , G). In this paper, we extend a model of Jiang et al. [CPM 2016] (i) by allowing the insertions of strings instead of single characters (i.e., some groups of genes may be forced to be inserted together) and (ii) by considering two alternative score functions: the first generalizes the notion of common adjacencies by maximizing the number of common k-mers between S * and G (k-Mer Scaffold Filling), the second aims at minimizing the number of breakpoints between S * and G (Min-Breakpoint Scaffold Filling). We study these problems from the parameterized complexity point of view, providing fixed-parameter (FPT) algorithms for both problems. In particular, we show that k-Mer Scaffold Filling is FPT wrt. parameter , the number of additional k-mers realized by the completion of S—this answers an open question of Jiang et al. [CPM 2016]. We also show that Min-Breakpoint Scaffold Filling is FPT wrt. a parameter combining the number of missing genes, the number of gene repetitions and the target distance
Reconstructing the History of Syntenies Through Super-Reconciliation
Classical gene and species tree reconciliation, used to infer the history of gene gain and loss explaining the evolution of gene families, assumes an independent evolution for each family. While this assumption is reasonable for genes that are far apart in the genome, it is clearly not suited for genes grouped in syntenic blocks, which are more plausibly the result of a concerted evolution. Here, we introduce the Super-Reconciliation model, that extends the traditional Duplication-Loss model to the reconciliation of a set of trees, accounting for segmental duplications and losses. From a complexity point of view, we show that the associated decision problem is NP-hard. We then give an exact exponential-time algorithm for this problem, assess its time efficiency on simulated datasets, and give a proof of concept on the opioid receptor genes
Minimum Common String Partition: Exact Algorithms
In the minimum common string partition problem (MCSP), one gets two strings and is asked to find the minimum number of cuts in the first string such that the second string can be obtained by rearranging the resulting pieces. It is a difficult algorithmic problem having applications in computational biology, text processing, and data compression. MCSP has been studied extensively from various algorithmic angles: there are many papers studying approximation, heuristic, and parameterized algorithms. At the same time, almost nothing is known about its exact complexity. In this paper, we present new results in this direction. We improve the known 2? upper bound (where n is the length of input strings) to ?? where ? ? 1.618... is the golden ratio. The algorithm uses Fibonacci numbers to encode subsets as monomials of a certain implicit polynomial and extracts one of its coefficients using the fast Fourier transform. Then, we show that the case of constant size alphabet can be solved in subexponential time 2^{O(nlog log n/log n)} by a hybrid strategy: enumerate all long pieces and use dynamic programming over histograms of short pieces. Finally, we prove almost matching lower bounds assuming the Exponential Time Hypothesis
Evolution through segmental duplications and losses : A Super-Reconciliation approach
The classical gene and species tree reconciliation, used to infer the history of gene gain and loss explaining the evolution of gene families, assumes an independent evolution for each family. While this assumption is reasonable for genes that are far apart in the genome, it is not appropriate for genes grouped into syntenic blocks, which are more plausibly the result of a concerted evolution. Here, we introduce the Super-Reconciliation problem which consists in inferring a history of segmental duplication and loss events (involving a set of neighboring genes) leading to a set of present-day syntenies from a single ancestral one. In other words, we extend the traditional Duplication-Loss reconciliation problem of a single gene tree, to a set of trees, accounting for segmental duplications and losses. Existency of a Super-Reconciliation depends on individual gene tree consistency. In addition, ignoring rearrangements implies that existency also depends on gene order consistency. We first show that the problem of reconstructing a most parsimonious Super-Reconciliation, if any, is NP-hard and give an exact exponential-time algorithm to solve it. Alternatively, we show that accounting for rearrangements in the evolutionary model, but still only minimizing segmental duplication and loss events, leads to an exact polynomial-time algorithm. We finally assess time efficiency of the former exponential time algorithm for the Duplication-Loss model on simulated datasets, and give a proof of concept on the opioid receptor genes
Design and Evaluation of Packet Classification Systems, Doctoral Dissertation, December 2006
Although many algorithms and architectures have been proposed, the design of efficient packet classification systems remains a challenging problem. The diversity of filter specifications, the scale of filter sets, and the throughput requirements of high speed networks all contribute to the difficulty. We need to review the algorithms from a high-level point-of-view in order to advance the study. This level of understanding can lead to significant performance improvements. In this dissertation, we evaluate several existing algorithms and present several new algorithms as well. The previous evaluation results for existing algorithms are not convincing because they have not been done in a consistent way. To resolve this issue, an objective evaluation platform needs to be developed. We implement and evaluate several representative algorithms with uniform criteria. The source code and the evaluation results are both published on a web-site to provide the research community a benchmark for impartial and thorough algorithm evaluations. We propose several new algorithms to deal with the different variations of the packet classification problem. They are: (1) the Shape Shifting Trie algorithm for longest prefix matching, used in IP lookups or as a building block for general packet classification algorithms; (2) the Fast Hash Table lookup algorithm used for exact flow match; (3) the longest prefix matching algorithm using hash tables and tries, used in IP lookups or packet classification algorithms;(4) the 2D coarse-grained tuple-space search algorithm with controlled filter expansion, used for two-dimensional packet classification or as a building block for general packet classification algorithms; (5) the Adaptive Binary Cutting algorithm used for general multi-dimensional packet classification. In addition to the algorithmic solutions, we also consider the TCAM hardware solution. In particular, we address the TCAM filter update problem for general packet classification and provide an efficient algorithm. Building upon the previous work, these algorithms significantly improve the performance of packet classification systems and set a solid foundation for further study
The Maximum Duo-Preservation String Mapping Problem with Bounded Alphabet
Given two strings A and B such that B is a permutation of A, the max duo-preservation string mapping (MPSM) problem asks to find a mapping ? between them so as to preserve a maximum number of duos. A duo is any pair of consecutive characters in a string and it is preserved by ? if its two consecutive characters in A are mapped to same two consecutive characters in B. This problem has received a growing attention in recent years, partly as an alternative way to produce approximation algorithms for its minimization counterpart, min common string partition, a widely studied problem due its applications in comparative genomics. Considering this favored field of application with short alphabet, it is surprising that MPSM^?, the variant of MPSM with bounded alphabet, has received so little attention, with a single yet impressive work that provides a 2.67-approximation achieved in O(n) [Brubach, 2018], where n = |A| = |B|. Our work focuses on MPSM^?, and our main contribution is the demonstration that this problem admits a Polynomial Time Approximation Scheme (PTAS) when ? = O(1). We also provide an alternate, somewhat simpler, proof of NP-hardness for this problem compared with the NP-hardness proof presented in [Haitao Jiang et al., 2012]
A Family of Approximation Algorithms for the Maximum Duo-Preservation String Mapping Problem
In the Maximum Duo-Preservation String Mapping problem we are given two strings and wish to map the letters of the former to the letters of the latter as to maximise the number of duos. A duo is a pair of consecutive letters that is mapped to a pair of consecutive letters in the same order. This is complementary to the well-studied Minimum Common String Partition problem, where the goal is to partition the former string into blocks that can be permuted and concatenated to obtain the latter string.
Maximum Duo-Preservation String Mapping is APX-hard. After a series of improvements, Brubach [WABI 2016] showed a polynomial-time 3.25-approximation algorithm. Our main contribution is that, for any eps>0, there exists a polynomial-time (2+eps)-approximation algorithm. Similarly to a previous solution by Boria et al. [CPM 2016], our algorithm uses the local search technique. However, this is used only after a certain preliminary greedy procedure, which gives us more structure and makes a more general local search possible. We complement this with a specialised version of the algorithm that achieves 2.67-approximation in quadratic time
Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping
In the Maximum-Duo Preservation String Mapping (Max-Duo PSM) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in time 4^k * n^{O(1)}, and a deterministic algorithm that solves this problem in time 6.855^k * n^{O(1)}. The previous best known (deterministic) algorithm for this problem has running time (8e)^{2k+o(k)} * n^{O(1)} [Beretta et al., Theor. Comput. Sci. 2016]. We also show that Max-Duo PSM admits a problem kernel of size O(k^3), improving upon the previous best known problem kernel of size O(k^6)
- …