Search CORE

Clustering exact matches of pairwise sequence alignments by weighted linear regression

Author: Alvaro J González
F Sanger
Li Liao
PA Pevzner
S Kurtz
SF Altschul
TF Smith
WJ Kent
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background At intermediate stages of genome assembly projects, when a number of contigs have been generated and their validity needs to be verified, it is desirable to align these contigs to a reference genome when it is available. The interest is not to analyze a detailed alignment between a contig and the reference genome at the base level, but rather to have a rough estimate of where the contig aligns to the reference genome, specifically, by identifying the starting and ending positions of such a region. This information is very useful in ordering the contigs, facilitating post-assembly analysis such as gap closure and resolving repeats. There exist programs, such as BLAST and MUMmer, that can quickly align and identify high similarity segments between two sequences, which, when seen in a dot plot, tend to agglomerate along a diagonal but can also be disrupted by gaps or shifted away from the main diagonal due to mismatches between the contig and the reference. It is a tedious and practically impossible task to visually inspect the dot plot to identify the regions covered by a large number of contigs from sequence assembly projects. A forced global alignment between a contig and the reference is not only time consuming but often meaningless. Results We have developed an algorithm that uses the coordinates of all the exact matches or high similarity local alignments, clusters them with respect to the main diagonal in the dot plot using a weighted linear regression technique, and identifies the starting and ending coordinates of the region of interest. Conclusion This algorithm complements existing pairwise sequence alignment packages by replacing the time-consuming seed extension phase with a weighted linear regression for the alignment seeds. It was experimentally shown that the gain in execution time can be outstanding without compromising the accuracy. This method should be of great utility to sequence assembly and genome comparison projects.</p

arXiv.org e-Print Archive

Space-efficient merging of succinct de Bruijn graphs

Author: A Bowe
B Alipanahi
D Belazzougui
FA Louza
J Holt
L Egidi
MD Muggli
MD Muggli
PA Pevzner
S Marcus
Z Iqbal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is based on the lightweight BWT merging approach by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014]. Our algorithm has the same asymptotic cost of the state of the art tool for the same problem presented by Muggli et al. [bioRxiv 2017, Bioinformatics 2019], but it uses less than half of its working space. A novel important feature of our algorithm, not found in any of the existing tools, is that it can compute the Variable Order succinct representation of the union graph within the same asymptotic time/space bounds.Comment: Accepted to SPIRE'1

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

SEARCHPATTOOL: a new method for mining the most specific frequent patterns for binding sites with application to prokaryotic DNA sequences

Author: A Brazma
A Califano
B Brejova
DR Cavener
E Eskin
Fathi Elloumi
FP Roth
G Pavesi
G Thijs
GZ Hertz
H Salgado
I Jonassen
I Rigoutsos
I Rigoutsos
J Van Helden
M Burset
M Tompa
Martha Nason
PA Pevzner
PA Pevzner
R Agrawal
S Sinha
S Sinha
TL Bailey
Y Makita
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Computational methods to predict transcription factor binding sites (TFBS) based on exhaustive algorithms are guaranteed to find the best patterns but are often limited to short ones or impose some constraints on the pattern type. Many patterns for binding sites in prokaryotic species are not well characterized but are known to be large, between 16–30 base pairs (bp) and contain at least 2 conserved bases. The length of prokaryotic species promoters (about 400 bp) and our interest in studying a small set of genes that could be a cluster of co-regulated genes from microarray experiments led to the development of a new exhaustive algorithm targeting these large patterns. Results We present Searchpattool, a new method to search for and select the most specific (conservative) frequent patterns. This method does not impose restrictions on the density or the structure of the pattern. The best patterns (motifs) are selected using several statistics, including a new application of a z-score based on the number of matching sequences. We compared Searchpattool against other well known algorithms on a <it>Bacillus subtilis </it>group of 14 input sequences and found that in our experiments Searchpattool always performed the best based on performance scores. Conclusion Searchpattool is a new method for pattern discovery relative to transcription factor binding sites for species or genes with short promoters. It outputs the most specific significant patterns and helps the biologist to choose the best candidates.</p

Reconstructing cancer genomes from paired-end sequencing data

Author: A Kotzig
AA Steinhardt
Anna Ritz
AR Quinlan
B Raphael
Benjamin J Raphael
BJ Druker
BJ Raphael
BJ Raphael
C Greenman
CD Greenman
CK Ng
D Hochbaum
DG Albertson
DR Bentley
DY Chiang
E Tuzun
ER Mardis
F Hormozdiari
JO Korbel
K Chen
Layla Oesper
LE Kelemen
M Meyerson
MA Alekseyev
MC Schatz
P Kauraniemi
P Medvedev
P Medvedev
P Medvedev
P Pevzner
PA Pevzner
PA Pevzner
PA Pevzner
PJ Campbell
PJ Stephens
R Wittler
R Xi
RE Mills
Ryan Drebin
S Durinck
S Hannenhalli
S Sindi
S Takakura
S Volik
S Yoon
SA Moestue
Sarah J Aerni
Y Jung
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Abstract Background A cancer genome is derived from the germline genome through a series of somatic mutations. Somatic structural variants - including duplications, deletions, inversions, translocations, and other rearrangements - result in a cancer genome that is a scrambling of intervals, or "blocks" of the germline genome sequence. We present an efficient algorithm for reconstructing the block organization of a cancer genome from paired-end DNA sequencing data. Results By aligning paired reads from a cancer genome - and a matched germline genome, if available - to the human reference genome, we derive: (i) a partition of the reference genome into intervals; (ii) adjacencies between these intervals in the cancer genome; (iii) an estimated copy number for each interval. We formulate the Copy Number and Adjacency Genome Reconstruction Problem of determining the cancer genome as a sequence of the derived intervals that is consistent with the measured adjacencies and copy numbers. We design an efficient algorithm, called Paired-end Reconstruction of Genome Organization (PREGO), to solve this problem by reducing it to an optimization problem on an interval-adjacency graph constructed from the data. The solution to the optimization problem results in an Eulerian graph, containing an alternating Eulerian tour that corresponds to a cancer genome that is consistent with the sequencing data. We apply our algorithm to five ovarian cancer genomes that were sequenced as part of The Cancer Genome Atlas. We identify numerous rearrangements, or structural variants, in these genomes, analyze reciprocal vs. non-reciprocal rearrangements, and identify rearrangements consistent with known mechanisms of duplication such as tandem duplications and breakage/fusion/bridge (B/F/B) cycles. Conclusions We demonstrate that PREGO efficiently identifies complex and biologically relevant rearrangements in cancer genome sequencing data. An implementation of the PREGO algorithm is available at <url>http://compbio.cs.brown.edu/software/</url>.</p

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

Author: BG Jackson
DR Zerbino
EW Myers
Hieu Dinh
JD Kececioglu
JT Simpson
Matthew Vaughn
P Medvedev
PA Pevzner
S Batzoglou
Sanguthevar Rajasekaran
Vamsi K Kundeti
Vishal Thapar
X Huang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an <it>O</it>(<it>n/p</it>) time parallel algorithm has been given for this problem. Here <it>n </it>is the size of the input and <it>p </it>is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(<it>n</it>Σ) messages (Σ being the size of the alphabet). Results In this paper we present a Θ(<it>n/p</it>) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of <inline-formula><m:math name="1471-2105-11-560-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:mrow><m:mo>Θ</m:mo><m:mo stretchy="false">(</m:mo><m:mfrac><m:mrow><m:mi>n</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>n</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow><m:mrow><m:mi>B</m:mi><m:mi>log</m:mi><m:mo stretchy="false">(</m:mo><m:mi>M</m:mi><m:mo>/</m:mo><m:mi>B</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mfrac><m:mo stretchy="false">)</m:mo></m:mrow></m:math></inline-formula> (<it>M </it>being the main memory size and <it>B </it>being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. Conclusions The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.</p

CiteSeerX

Cold Spring Harbor Laboratory Institutional Repository

Maastricht University Research Portal

Texas ScholarWorks

Bioinformatics : indispensable, yet hidden in plain sight?

Author: A Bartlett
Andrew Bartlett
B Latour
B Penders
B Penders
Bart Penders
J Baren-Nawrocka Van
J Ben-David
J Calvert
J Lewis
J Lewis
Jamie Lewis
LD Stein
M Garcia-Sancho
MV Schneider
N Vermeulen
N Vermeulen
PA Pevzner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2017
Field of study

BACKGROUND: Bioinformatics has multitudinous identities, organisational alignments and disciplinary links. This variety allows bioinformaticians and bioinformatic work to contribute to much (if not most) of life science research in profound ways. The multitude of bioinformatic work also translates into a multitude of credit-distribution arrangements, apparently dismissing that work. RESULTS: We report on the epistemic and social arrangements that characterise the relationship between bioinformatics and life science. We describe, in sociological terms, the character, power and future of bioinformatic work. The character of bioinformatic work is such that its cultural, institutional and technical structures allow for it to be black-boxed easily. The result is that bioinformatic expertise and contributions travel easily and quickly, yet remain largely uncredited. The power of bioinformatic work is shaped by its dependency on life science work, which combined with the black-boxed character of bioinformatic expertise further contributes to situating bioinformatics on the periphery of the life sciences. Finally, the imagined futures of bioinformatic work suggest that bioinformatics will become ever more indispensable without necessarily becoming more visible, forcing bioinformaticians into difficult professional and career choices. CONCLUSIONS: Bioinformatic expertise and labour is epistemically central but often institutionally peripheral. In part, this is a result of the ways in which the character, power distribution and potential futures of bioinformatics are constituted. However, alternative paths can be imagined

Online Research @ Cardiff

White Rose Research Online

Assembly complexity of prokaryotic genomes using short reads

Author: A Guénoche
AR Rubinov
B Bollobás
B Haubold
C Smith
Carl Kingsford
D Gusfield
DH Huson
DR Zerbino
Dvan den Broek
E Myers
EW Myers
I Simon
J Butler
J Parkhill
JAA Quitzau
JC Dohm
JP Hutchinson
JP Hutchinson
M Antoniotti
M Margulies
Michael C Schatz
Mihai Pop
MJ Chaisson
MJ Chaisson
MS Waterman
N de Bruijn
N Whiteford
OG Troyanskaya
P Medvedev
PA Pevzner
PA Pevzner
R Barrangou
R Idury
S Batzoglou
T van Aardenne-Ehrenfest
TD Harris
WR Jeck
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background De Bruijn graphs are a theoretical framework underlying several modern genome assembly programs, especially those that deal with very short reads. We describe an application of de Bruijn graphs to analyze the global repeat structure of prokaryotic genomes. Results We provide the first survey of the repeat structure of a large number of genomes. The analysis gives an upper-bound on the performance of genome assemblers for <it>de novo </it>reconstruction of genomes across a wide range of read lengths. Further, we demonstrate that the majority of genes in prokaryotic genomes can be reconstructed uniquely using very short reads even if the genomes themselves cannot. The non-reconstructible genes are overwhelmingly related to mobile elements (transposons, IS elements, and prophages). Conclusions Our results improve upon previous studies on the feasibility of assembly with short reads and provide a comprehensive benchmark against which to compare the performance of the short-read assemblers currently being developed.</p

Cold Spring Harbor Laboratory Institutional Repository

Digital Repository at the University of Maryland

Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms

Author: Amit U Sinha
BME Moret
D Sankoff
DA Bader
DL Wheeler
EM Marcotte
G Andelfinger
G Bourque
G Tesler
Jaroslaw Meller
JH Nadeau
JL Bentley
KA Frazer
LD Stein
M Clamp
PA Pevzner
Q Peng
S Hannenhalli
T Hubbard
TF Deluca
X Pan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Identifying syntenic regions, i.e., blocks of genes or other markers with evolutionary conserved order, and quantifying evolutionary relatedness between genomes in terms of chromosomal rearrangements is one of the central goals in comparative genomics. However, the analysis of synteny and the resulting assessment of genome rearrangements are sensitive to the choice of a number of arbitrary parameters that affect the detection of synteny blocks. In particular, the choice of a set of markers and the effect of different aggregation strategies, which enable coarse graining of synteny blocks and exclusion of micro-rearrangements, need to be assessed. Therefore, existing tools and resources that facilitate identification, visualization and analysis of synteny need to be further improved to provide a flexible platform for such analysis, especially in the context of multiple genomes. RESULTS: We present a new tool, Cinteny, for fast identification and analysis of synteny with different sets of markers and various levels of coarse graining of syntenic blocks. Using Hannenhalli-Pevzner approach and its extensions, Cinteny also enables interactive determination of evolutionary relationships between genomes in terms of the number of rearrangements (the reversal distance). In particular, Cinteny provides: i) integration of synteny browsing with assessment of evolutionary distances for multiple genomes; ii) flexibility to adjust the parameters and re-compute the results on-the-fly; iii) ability to work with user provided data, such as orthologous genes, sequence tags or other conserved markers. In addition, Cinteny provides many annotated mammalian, invertebrate and fungal genomes that are pre-loaded and available for analysis at . CONCLUSION: Cinteny allows one to automatically compare multiple genomes and perform sensitivity analysis for synteny block detection and for the subsequent computation of reversal distances. Cinteny can also be used to interactively browse syntenic blocks conserved in multiple genomes, to facilitate genome annotation and validation of assemblies for newly sequenced genomes, and to construct and assess phylogenomic trees

Infoscience - École polytechnique fédérale de Lausanne

Sequencing by Hybridization of Long Targets

Author: AM Frieze
AR Abate
AR Abate
C Blum
Cynthia Gibas
DJ Cutler
J Blazewicz
J Blazewicz
J Blazewicz
M Dyer
Michael P. Brenner
PA Pevzner
R Arratia
R Drmanac
R Drmanac
SB Needleman
TA Endo
Tobias M. Schneider
W Bains
Yu Qin
Publication venue: Public Library of Science
Publication date: 04/05/2012
Field of study

Sequencing by Hybridization (SBH) reconstructs an n-long target DNA sequence from its biochemically determined l-long subsequences. In the standard approach, the length of a uniformly random sequence that can be unambiguously reconstructed is limited to due to repetitive subsequences causing reconstruction degeneracies. We present a modified sequencing method that overcomes this limitation without the need for different types of biochemical assays and is robust to error

Public Library of Science (PLOS)

Harvard University - DASH