Search CORE

Cold Spring Harbor Laboratory Institutional Repository

Texas ScholarWorks

Reconstructing the modular recombination history of Staphylococcus aureus phages

Author: Anne Bergeron
D Botstein
E Ukkonen
GM Rousseau
HF Chambers
Hugo Deschênes
J Kahankova
JD Kececioglu
JT Martinsohn
Krister M Swenson
M Krupovic
Paul Guertin
Y Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost

Author: A Bahr
AR Subramanian
C Grasso
C Notredame
C Notredame
CB Do
Hayato Yamana
J Kececioglu
JD Thompson
JD Thompson
JD Thompson
JD Thompson
K Karplus
K Katoh
K Katoh
MA McClure
O Gotoh
O Gotoh
O Gotoh
O Gotoh
O Gotoh
O Gotoh
O Gotoh
O Gotoh
O O'Sullivan
Osamu Gotoh
RC Edgar
Shinsuke Yamada
T Jiang
W Miller
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Multiple sequence alignment (MSA) is a useful tool in bioinformatics. Although many MSA algorithms have been developed, there is still room for improvement in accuracy and speed. In the alignment of a family of protein sequences, global MSA algorithms perform better than local ones in many cases, while local ones perform better than global ones when some sequences have long insertions or deletions (indels) relative to others. Many recent leading MSA algorithms have incorporated pairwise alignment information obtained from a mixture of sources into their scoring system to improve accuracy of alignment containing long indels. RESULTS: We propose a novel group-to-group sequence alignment algorithm that uses a piecewise linear gap cost. We developed a program called PRIME, which employs our proposed algorithm to optimize the well-defined sum-of-pairs score. PRIME stands for Profile-based Randomized Iteration MEthod. We evaluated PRIME and some recent MSA programs using BAliBASE version 3.0 and PREFAB version 4.0 benchmarks. The results of benchmark tests showed that PRIME can construct accurate alignments comparable to the most accurate programs currently available, including L-INS-i of MAFFT, ProbCons, and T-Coffee. CONCLUSION: PRIME enables users to construct accurate alignments without having to employ pairwise alignment information. PRIME is available at

Approximating the double-cut-and-join distance between unsigned genomes

Author: A Bergeron
A Caprara
A Caprara
CM Papadimitriou
G Lin
H Jiang
JD Kececioglu
Jiadong Yu
MM Halldórsson
R Sun
Ruimin Sun
S Hannenhalli
S Hannenhalli
S Hannenhalli
S Yancopoulos
V Bafna
X Chen
Xin Chen
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

In this paper we study the problem of sorting unsigned genomes by double-cut-and-join operations, where genomes allow a mix of linear and circular chromosomes to be present. First, we formulate an equivalent optimization problem, called maximum cycle/path decomposition, which is aimed at finding a largest collection of edge-disjoint cycles/AA-paths/AB-paths in a breakpoint graph. Then, we show that the problem of finding a largest collection of edge-disjoint cycles/AA-paths/AB-paths of length no more than l can be reduced to the well-known degree-bounded k-set packing problem with k = 2l. Finally, a polynomial-time approximation algorithm for the problem of sorting unsigned genomes by double-cut-and-join operations is devised, which achieves the approximation ratio for any positive ε. For the restricted variation where each genome contains only one linear chromosome, the approximation ratio can be further improved t

DR-NTU (Digital Repository of NTU)

Safe and complete contig assembly via omnitigs

Author: A Bankevich
A Guénoche
AR Rubinov
AS Motahari
C Kingsford
D Haussler
DR Zerbino
E Kapun
E Kapun
ES Lander
G Bresler
G Narzisi
I Lysov
JD Kececioglu
JR Miller
JT Simpson
JT Simpson
K Lam
K Sahlin
L Salmela
M Boetzer
M Boetzer
N Nagarajan
N Nagarajan
N Vyahhi
P Medvedev
P Medvedev
P Medvedev
PA Pevzner
PA Pevzner
R Chikhi
R Chikhi
R Luo
R Uricaru
RM Idury
SL Salzberg
Publication venue
Publication date: 16/08/2016
Field of study

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph

G

(e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from

G

as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

arXiv.org e-Print Archive

Optimizing substitution matrix choice and gap parameters for sequence alignment

Author: CB Do
CB Do
CN Dewey
D Gusfield
DT Jones
E Kim
G Blackshields
GA Price
GH Gonnet
I Van Walle
J Flannick
J Kececioglu
J Pei
JD Thompson
JD Thompson
JG Henikoff
K Katoh
M Box
MA Larkin
MO Dayhoff
MP Styczynski
MS Waterman
O Chapelle
RC Edgar
RC Edgar
Robert C Edgar
S Henikoff
T Lassmann
T Muller
T Muller
TM Phuong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. Results POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. Conclusion The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at <url>http://www.drive5.com/pop</url>.</p

Public Library of Science (PLOS)

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Author: A Löytynoja
B Chevreux
B Morgenstern
C Notredame
CNS Pedersen
D Huchon
D Przybylski
D Sankoff
D Zheng
DG Higgins
E Dermitzakis
Emmanuel J. P. Douzery
F Abascal
F Delsuc
Frédéric Delsuc
H Philippe
H Zhao
J Hein
J Kececioglu
J Kececioglu
J Raes
JD Thompson
K Katoh
KM Wong
L Arvestad
L Salmela
M Dayhoff
M Gouy
M Kircher
M Margulies
M Suyama
MT Gilbert
N Galtier
OR Bininda-Emonds
P Sneath
PJ Farabaugh
R Wernersson
RC Edgar
RC Edgar
RK Bradley
RR Stocsits
RW Meredith
S Henikoff
S Needleman
SF Altschul
SF Altschul
SS Steiger
Sébastien Harispe
T Smith
TA Demere
TJ Hubbard
TJ Wheeler
V Ranwez
Vincent Ranwez
William J. Murphy
X Guan
X Huang
Y Van de Peer
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Until now the most efficient solution to align nucleotide sequences containing open reading frames was to use indirect procedures that align amino acid translation before reporting the inferred gap positions at the codon level. There are two important pitfalls with this approach. Firstly, any premature stop codon impedes using such a strategy. Secondly, each sequence is translated with the same reading frame from beginning to end, so that the presence of a single additional nucleotide leads to both aberrant translation and alignment

Public Library of Science (PLOS)

LOCAS – A Low Coverage Assembly Tool for Resequencing Projects

Author: A Doring
AR Quinlan
B Langmead
C Nusbaum
D Hernandez
D Weigel
Daniel H. Huson
DC Richter
Detlef Weigel
DR Zerbino
EW Myers
H Li
H Li
I Birol
JD Kececioglu
JO Korbel
JT Simpson
Juliane D. Klein
K Schneeberger
K Schneeberger
Korbinian Schneeberger
LE Palmer
M Pop
M Pop
MC Wendl
MJ Chaisson
PA Pevzner
R Li
R Li
RM Durbin
S Ossowski
SL Salzberg
SM Rumble
SQ Le
Stephan Ossowski
T Rausch
Ying Xu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Motivation: Next Generation Sequencing (NGS) is a frequently applied approach to detect sequence variations between highly related genomes. Recent large-scale re-sequencing studies as the Human 1000 Genomes Project utilize NGS data of low coverage to afford sequencing of hundreds of individuals. Here, SNPs and micro-indels can be detected by applying an alignment-consensus approach. However, computational methods capable of discovering other variations such as novel insertions or highly diverged sequence from low coverage NGS data are still lacking. Results: We present LOCAS, a new NGS assembler particularly designed for low coverage assembly of eukaryotic genomes using a mismatch sensitive overlap-layout-consensus approach. LOCAS assembles homologous regions in a homologyguided manner while it performs de novo assemblies of insertions and highly polymorphic target regions subsequently to an alignment-consensus approach. LOCAS has been evaluated in homology-guided assembly scenarios with low sequence coverage of Arabidopsis thaliana strains sequenced as part of the Arabidopsis 1001 Genomes Project. While assembling the same amount of long insertions as state-of-the-art NGS assemblers, LOCAS showed best results regarding contig size, error rate and runtime. Conclusion: LOCAS produces excellent results for homology-guided assembly of eukaryotic genomes with short reads and low sequencing depth, and therefore appears to be the assembly tool of choice for the detection of novel sequenc

CiteSeerX