Safe and complete contig assembly via omnitigs

A Bankevich; A Guénoche; AR Rubinov; AS Motahari; C Kingsford; D Haussler; DR Zerbino; E Kapun; E Kapun; ES Lander; G Bresler; G Narzisi; I Lysov; JD Kececioglu; JR Miller; JT Simpson; JT Simpson; K Lam; K Sahlin; L Salmela; M Boetzer; M Boetzer; N Nagarajan; N Nagarajan; N Vyahhi; P Medvedev; P Medvedev; P Medvedev; PA Pevzner; PA Pevzner; R Chikhi; R Chikhi; R Luo; R Uricaru; RM Idury; SL Salzberg

research

Safe and complete contig assembly via omnitigs

Authors: A Bankevich
A Guénoche
AR Rubinov
AS Motahari
C Kingsford
D Haussler
DR Zerbino
E Kapun
E Kapun
ES Lander
G Bresler
G Narzisi
I Lysov
JD Kececioglu
JR Miller
JT Simpson
JT Simpson
K Lam
K Sahlin
L Salmela
M Boetzer
M Boetzer
N Nagarajan
N Nagarajan
N Vyahhi
P Medvedev
P Medvedev
P Medvedev
PA Pevzner
PA Pevzner
R Chikhi
R Chikhi
R Luo
R Uricaru
RM Idury
SL Salzberg
Publication date: 16 August 2016
Publisher
Doi

Abstract

Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph

G

(e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from

G

as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201

Similar works

Full text

Available Versions

Crossref

info:doi/10.1007%2F978-3-319-3...

Last time updated on 30/03/2019