Search CORE

38 research outputs found

Approximating Weighted Duo-Preservation in Comparative Genomics

Author: AR Mushegian
B Brubach
G Cormode
H Jiang
KM Swenson
L Bulteau
M Chrobak
N Boria
NH Mustafa
RC Hardison
S Beretta
TM Chan
W Chen
X Chen
Publication venue
Publication date: 30/08/2017
Field of study

Motivated by comparative genomics, Chen et al. [9] introduced the Maximum Duo-preservation String Mapping (MDSM) problem in which we are given two strings

s_1

and

s_2

from the same alphabet and the goal is to find a mapping

\pi

between them so as to maximize the number of duos preserved. A duo is any two consecutive characters in a string and it is preserved in the mapping if its two consecutive characters in

s_1

are mapped to same two consecutive characters in

s_2

. The MDSM problem is known to be NP-hard and there are approximation algorithms for this problem [3, 5, 13], but all of them consider only the "unweighted" version of the problem in the sense that a duo from

s_1

is preserved by mapping to any same duo in

s_2

regardless of their positions in the respective strings. However, it is well-desired in comparative genomics to find mappings that consider preserving duos that are "closer" to each other under some distance measure [19]. In this paper, we introduce a generalized version of the problem, called the Maximum-Weight Duo-preservation String Mapping (MWDSM) problem that captures both duos-preservation and duos-distance measures in the sense that mapping a duo from

s_1

to each preserved duo in

s_2

has a weight, indicating the "closeness" of the two duos. The objective of the MWDSM problem is to find a mapping so as to maximize the total weight of preserved duos. In this paper, we give a polynomial-time 6-approximation algorithm for this problem.Comment: Appeared in proceedings of the 23rd International Computing and Combinatorics Conference (COCOON 2017

arXiv.org e-Print Archive

Crossref

An Integer Programming Formulation of the Minimum Common String Partition problem

Author: Ferdous S. M.
Rahman M. Sohel
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/05/2014
Field of study

We consider the problem of finding a minimum common partition of two strings (MCSP). The problem has its application in genome comparison. MCSP problem is proved to be NP-hard. In this paper, we develop an Integer Programming (IP) formulation for the problem and implement it. The experimental results are compared with the previous state-of-the-art algorithms and are found to be promising.Comment: arXiv admin note: text overlap with arXiv:1401.453

arXiv.org e-Print Archive

CiteSeerX

Directory of Open Access Journals

PubMed Central

FigShare

Computational Performance Evaluation of Two Integer Linear Programming Models for the Minimum Common String Partition Problem

Author: Blum Christian
Raidl Günther R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/03/2015
Field of study

In the minimum common string partition (MCSP) problem two related input strings are given. "Related" refers to the property that both strings consist of the same set of letters appearing the same number of times in each of the two strings. The MCSP seeks a minimum cardinality partitioning of one string into non-overlapping substrings that is also a valid partitioning for the second string. This problem has applications in bioinformatics e.g. in analyzing related DNA or protein sequences. For strings with lengths less than about 1000 letters, a previously published integer linear programming (ILP) formulation yields, when solved with a state-of-the-art solver such as CPLEX, satisfactory results. In this work, we propose a new, alternative ILP model that is compared to the former one. While a polyhedral study shows the linear programming relaxations of the two models to be equally strong, a comprehensive experimental comparison using real-world as well as artificially created benchmark instances indicates substantial computational advantages of the new formulation.Comment: arXiv admin note: text overlap with arXiv:1405.5646 This paper version replaces the one submitted on January 10, 2015, due to detected error in the calculation of the variables involved in the ILP model

arXiv.org e-Print Archive

Digital.CSIC

Gene Order Phylogeny and the Evolution of Methanogens

Author: Arndt William
Friedman Robert
Luo Haiwei
Shi Jian
Sun Zhiyi
Tang Jijun
Publication venue: Scholar Commons
Publication date: 01/06/2009
Field of study

Methanogens are a phylogenetically diverse group belonging to Euryarchaeota. Previously, phylogenetic approaches using large datasets revealed that methanogens can be grouped into two classes, “Class I” and “Class II”. However, some deep relationships were not resolved. For instance, the monophyly of “Class I” methanogens, which consist of Methanopyrales, Methanobacteriales and Methanococcales, is disputable due to weak statistical support. In this study, we use MSOAR to identify common orthologous genes from eight methanogen species and a Thermococcale species (outgroup), and apply GRAPPA and FastME to compute distance-based gene order phylogeny. The gene order phylogeny supports two classes of methanogens, but it differs from the original classification of methanogens by placing Methanopyrales and Methanobacteriales together with Methanosarcinales in Class II rather than with Methanococcales. This study suggests a new classification scheme for methanogens. In addition, it indicates that gene order phylogeny can complement traditional sequence-based methods in addressing taxonomic questions for deep relationships

Directory of Open Access Journals

Scholar Commons - Institutional Repository of the University of South Carolina

PubMed Central

Genomes containing Duplicates are Hard to compare

Author: Chauve Cedric
Fertin Guillaume
Rizzi Romeo
Vialette Stéphane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

International audienceIn this paper, we are interested in the algorithmic complexity of computing (dis)similarity measures between two genomes when they contain duplicated genes. In that case, there are usually two main ways to compute a given (dis)similarity measure M between two genomes G1 and G2: the rst model, that we will call the matching model, consists in making a one-to-one correspondence between genes of G1 and genes of G2, in such a way that M is optimized. The second model, called the exemplar model, consists in keeping in G1 (resp. G2) exactly one copy of each gene, thus deleting all the other copies, in such a way that M is optimized. We present here dierent results concerning the algorithmic complexity of computing three dierent similarity measures (number of common intervals, MAD number and SAD number) in those two models, basically showing that the problem becomes NP-complete for each of them as soon as genomes contain duplicates. We show indeed that for common intervals, MAD and SAD, the problem is NP-complete when genes are duplicated in genomes, in both the exemplar and matching models. In the case of MAD and SAD, we actually prove that, under both models, both MAD and SAD problems are APX-har

Reversal Distances for Strings with Few Blocks or Small Alphabets

Author: A. Radcliffe
C.A.J. Hurkens
D.A. Christie
G. Watterson
J. Fischer
L. Bulteau
P. Berman
T. Jiang
V. Bafna
X. Chen
Z. Fu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceWe study the String Reversal Distance problem, an extension of the well-known Sorting by Reversals problem. String Reversal Distance takes two strings S and T as input, and asks for a minimum number of reversals to obtain T from S. We consider four variants: String Reversal Distance, String Prefix Reversal Distance (in which any reversal must include the first letter of the string), and the signed variants of these problems, namely Signed String Reversal Distance and Signed String Prefix Reversal Distance. We study algorithmic properties of these four problems, in connection with two parameters of the input strings: the number of blocks they contain (a block being maximal substring such that all letters in the substring are equal), and the alphabet size Σ. For instance, we show that Signed String Reversal Distance and Signed String Prefix Reversal Distance are NP-hard even if the input strings have only one letter

Crossref

Draft genome sequence of marine alphaproteobacterial strain HIMB11, the first cultivated representative of a unique lineage within the Roseobacter clade possessing an unusually small genome

Author: Bender Sara J.
Brown Julia M.
Casey John F.
Church Matthew J.
DeLong Edward F.
Dron Antony
Durham Bryndan P.
Eppley John
Florez-Leiva Lennis
Grim Sharon L.
Grote Jana
Karl David M.
Krupke Andreas
Kyrpides Nikos C.
Luo Haiwei
Luria Catherine M.
Mine Aric
Nigro Olivia D.
Pather Santhiska
Rappe Michael S.
Schuster Stephan
Steward Grieg F.
Talarmin Agathe
Wear Emma K.
Weber Thomas S.
Whittaker Kerry A.
Wilson Jesse M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

© The Author(s), 2014. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Standards in Genomic Sciences 9 (2014): 632-645, doi:10.4056/sigs.4998989.Strain HIMB11 is a planktonic marine bacterium isolated from coastal seawater in Kaneohe Bay, Oahu, Hawaii belonging to the ubiquitous and versatile Roseobacter clade of the alphaproteobacterial family Rhodobacteraceae. Here we describe the preliminary characteristics of strain HIMB11, including annotation of the draft genome sequence and comparative genomic analysis with other members of the Roseobacter lineage. The 3,098,747 bp draft genome is arranged in 34 contigs and contains 3,183 protein-coding genes and 54 RNA genes. Phylogenomic and 16S rRNA gene analyses indicate that HIMB11 represents a unique sublineage within the Roseobacter clade. Comparison with other publicly available genome sequences from members of the Roseobacter lineage reveals that strain HIMB11 has the genomic potential to utilize a wide variety of energy sources (e.g. organic matter, reduced inorganic sulfur, light, carbon monoxide), while possessing a reduced number of substrate transporters.We gratefully acknowledge the support of the Gordon and Betty Moore Foundation, which funded the sequencing of this genome. Annotation was performed as part of the 2011 C-MORE Summer Course in Microbial Oceanography (http://cmore.soest.hawaii.edu/summercourse/2011/index.htm), with support by the Agouron Institute, the Gordon and Betty Moore Foundation, the University of Hawaii and Manoa School of Ocean and Earth Science and Technology (SOEST), and the Center for Microbial Oceanography: Research and Education (C-MORE), a National Science Foundation-funded Science and Technology Center (award No. EF0424599)

DSpace@MIT

Crossref

Woods Hole Open Access Server

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

DigitalCommons@URI

On the minimum common integer partition problem

Author: Chen X.
Chrobak M.
Lan Liu
Sankoff D.
Tao Jiang
Xin Chen
Zheng Liu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Approximating reversal distance for strings with bounded number of duplicates

Author: Kolman Petr
Waleń Tomasz
Publication venue: Elsevier B.V.
Publication date: 01/02/2007
Field of study

AbstractFor a string A=a1…an, a reversal ρ(i,j), 1⩽i⩽j⩽n, transforms the string A into a string A′=a1…ai-1ajaj-1…aiaj+1… an, that is, the reversal ρ(i,j) reverses the order of symbols in the substring ai…aj of A. In the case of signed strings, where each symbol is given a sign + or -, the reversal operation also flips the sign of each symbol in the reversed substring. Given two strings, A and B, signed or unsigned, sorting by reversals (SBR) is the problem of finding the minimum number of reversals that transform the string A into the string B.Traditionally, the problem was studied for permutations, that is, for strings in which every symbol appears exactly once. We consider a generalization of the problem, k-SBR, and allow each symbol to appear at most k times in each string, for some k⩾1. The main result of the paper is an O(k2)-approximation algorithm running in time O(n). For instances with 3<k⩽O(lognlog*n), this is the best known approximation algorithm for k-SBRand, moreover, it is faster than the previous best approximation algorithm

Elsevier - Publisher Connector