Search CORE

Cold Spring Harbor Laboratory Institutional Repository

Digital Repository at the University of Maryland

Improving Phrap-Based Assembly of the Rat Using “Reliable” Overlaps

Author: AL Delcher
Aleksey V. Zimin
B Ewing
B Ewing
Brian R. Hunt
Cevat Ustun
EW Myers
GG Sutton
James R. White
James Yorke
JC Mullikin
M Roberts
Michael Roberts
Neil Hall
P Green
P Havlak
Paul Havlak
S Aparicio
S Batzoglou
S Schwartz
SL Salzberg
Wayne Hayes
X Huang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2008
Field of study

The assembly methods used for whole-genome shotgun (WGS) data have a major impact on the quality of resulting draft genomes. We present a novel algorithm to generate a set of “reliable” overlaps based on identifying repeat k-mers. To demonstrate the benefits of using reliable overlaps, we have created a version of the Phrap assembly program that uses only overlaps from a specific list. We call this version PhrapUMD. Integrating PhrapUMD and our “reliable-overlap” algorithm with the Baylor College of Medicine assembler, Atlas, we assemble the BACs from the Rattus norvegicus genome project. Starting with the same data as the Nov. 2002 Atlas assembly, we compare our results and the Atlas assembly to the 4.3 Mb of rat sequence in the 21 BACs that have been finished. Our version of the draft assembly of the 21 BACs increases the coverage of finished sequence from 93.4% to 96.3%, while simultaneously reducing the base error rate from 4.5 to 1.1 errors per 10,000 bases. There are a number of ways of assessing the relative merits of assemblies when the finished sequence is available. If one views the overall quality of an assembly as proportional to the inverse of the product of the error rate and sequence missed, then the assembly presented here is seven times better. The UMD Overlapper with options for reliable overlaps is available from the authors at http://www.genome.umd.edu. We also provide the changes to the Phrap source code enabling it to use only the reliable overlaps

CiteSeerX

Public Library of Science (PLOS)

eScholarship - University of California

Caltech Authors

Genome re-annotation: a wiki solution?

Author: AL Delcher
AV Lukashin
JC Venter
JD Peterson
O White
RD Fleischmann
SF Altschul
SR Eddy
Steven L Salzberg
The International Human Genome Sequencing Consortium
Publication venue: BioMed Central
Publication date: 01/02/2007
Field of study

The annotation of most genomes becomes outdated over time, owing in part to our ever-improving knowledge of genomes and in part to improvements in bioinformatics software. Unfortunately, annotation is rarely if ever updated and resources to support routine reannotation are scarce. Wiki software, which would allow many scientists to edit each genome's annotation, offers one possible solution

Digital Repository at the University of Maryland

Multiple organism algorithm for finding ultraconserved elements

Author: A Sandelin
A Siepel
A Woolfe
AL Delcher
AL Delcher
B Ma
CF Cheung
D Gusfield
D Lawson
EA Glazov
EH Margulies
G Bejerano
Greg Madey
HW Mewes
JC Venter
JZ Ni
LD Stein
M Brudno
MI Abouelhoda
N Bray
Neil F Lobo
P Ferragina
RA Holt
S Kurtz
S Kurtz
S Schwartz
Scott Christley
SF Altschul
T Tran
TJP Hubbard
U Manber
WJ Kent
WJ Kent
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality. Results We describe an algorithm and provide freely available software which can find all of the ultraconserved sequences between genomes of multiple organisms. Our algorithm takes a combinatorial approach that finds all sequences without requiring the genomes to be aligned. The algorithm is significantly faster than BLAST and is designed to handle very large genomes efficiently. We ran our algorithm on several large comparative analyses to evaluate its effectiveness; one compared 17 vertebrate genomes where we find 123 ultraconserved elements longer than 40 bps shared by all of the organisms, and another compared the human body louse, <it>Pediculus humanus humanus</it>, against itself and select insects to find thousands of non-coding, potentially functional sequences. Conclusion Whole genome comparative analysis for multiple organisms is both feasible and desirable in our search for biological knowledge. We argue that bioinformatic programs should be forward thinking by assuming analysis on multiple (and possibly large) genomes in the design and implementation of algorithms. Our algorithm shows how a compromise design with a trade-off of disk space versus memory space allows for efficient computation while only requiring modest computer resources, and at the same time providing benefits not available with other software.</p

CGAT: a comparative genome analysis tool for visualizing alignments in the analysis of complex evolutionary changes between closely related genomes

Author: A Chinen
A Nobusato
A van Belkum
AL Delcher
AL Delcher
AL Delcher
B Gottgens
B Ma
C Josenhans
D Gusfield
D Romero
DA Nix
DA Pollard
E Gilson
F Kunst
FR Blattner
G Levinson
H Takami
I Uchiyama
I Uchiyama
Ichizo Kobayashi
Ikuo Uchiyama
J Parkhill
J Yang
JF Tomb
JH Choi
JM Claverie
K Ishikawa
KA Frazer
M Brudno
M Brudno
M Brudno
M Kawai
M Kawai
MY Leung
N Bray
N Jareborg
N Jareborg
NA Moran
NJ Saunders
P Siguier
RA Alm
S Karlin
S Schwartz
S Schwartz
SB Needleman
SF Altschul
T Hayashi
T Tsuru
TJ Carver
Toshio Higuchi
U Dobrindt
W Huang
WJ Kent
WJ Kent
WR Pearson
Z Ning
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The recent accumulation of closely related genomic sequences provides a valuable resource for the elucidation of the evolutionary histories of various organisms. However, although numerous alignment calculation and visualization tools have been developed to date, the analysis of complex genomic changes, such as large insertions, deletions, inversions, translocations and duplications, still presents certain difficulties. RESULTS: We have developed a comparative genome analysis tool, named CGAT, which allows detailed comparisons of closely related bacteria-sized genomes mainly through visualizing middle-to-large-scale changes to infer underlying mechanisms. CGAT displays precomputed pairwise genome alignments on both dotplot and alignment viewers with scrolling and zooming functions, and allows users to move along the pre-identified orthologous alignments. Users can place several types of information on this alignment, such as the presence of tandem repeats or interspersed repetitive sequences and changes in G+C contents or codon usage bias, thereby facilitating the interpretation of the observed genomic changes. In addition to displaying precomputed alignments, the viewer can dynamically calculate the alignments between specified regions; this feature is especially useful for examining the alignment boundaries, as these boundaries are often obscure and can vary between programs. Besides the alignment browser functionalities, CGAT also contains an alignment data construction module, which contains various procedures that are commonly used for pre- and post-processing for large-scale alignment calculation, such as the split-and-merge protocol for calculating long alignments, chaining adjacent alignments, and ortholog identification. Indeed, CGAT provides a general framework for the calculation of genome-scale alignments using various existing programs as alignment engines, which allows users to compare the outputs of different alignment programs. Earlier versions of this program have been used successfully in our research to infer the evolutionary history of apparently complex genome changes between closely related eubacteria and archaea. CONCLUSION: CGAT is a practical tool for analyzing complex genomic changes between closely related genomes using existing alignment programs and other sequence analysis tools combined with extensive manual inspection

Context-driven discovery of gene cassettes in mobile integrons using a computational grammar

Author: A Moura
ACE Darling
AL Delcher
AL Delcher
CJ van Rijsbergen
D Frishman
DA Rowe-Magnus
DB Searls
E Rivas
Enrico Coiera
F Baquero
F Meyer
F Meyer
Guy Tsafnat
H Quesneville
HW Stokes
HW Stokes
IT Paulsen
J Fleiss
J Landis
Jaron Schaeffer
Jon R Iredell
K Rutherford
L Stein
M Ashburner
M Kanehisa
MA Andrade
MJ Joss
R Overbeek
RM Hall
RS Levings
S Ji
S Leung
Sally R Partridge
SF Altschul
SR Partridge
U Bohnebeck
WR Pearson
Y Boucher
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Gene discovery algorithms typically examine sequence data for low level patterns. A novel method to computationally discover higher order DNA structures is presented, using a context sensitive grammar. The algorithm was applied to the discovery of gene cassettes associated with integrons. The discovery and annotation of antibiotic resistance genes in such cassettes is essential for effective monitoring of antibiotic resistance patterns and formulation of public health antibiotic prescription policies. Results We discovered two new putative gene cassettes using the method, from 276 integron features and 978 GenBank sequences. The system achieved <it>κ </it>= 0.972 annotation agreement with an expert gold standard of 300 sequences. In rediscovery experiments, we deleted 789,196 cassette instances over 2030 experiments and correctly relabelled 85.6% (<it>α </it>≥ 95%, <it>E </it>≤ 1%, mean sensitivity = 0.86, specificity = 1, F-score = 0.93), with no false positives. Error analysis demonstrated that for 72,338 missed deletions, two adjacent deleted cassettes were labeled as a single cassette, increasing performance to 94.8% (mean sensitivity = 0.92, specificity = 1, F-score = 0.96). Conclusion Using grammars we were able to represent heuristic background knowledge about large and complex structures in DNA. Importantly, we were also able to use the context embedded in the model to discover new putative antibiotic resistance gene cassettes. The method is complementary to existing automatic annotation systems which operate at the sequence level.</p

Macquarie University ResearchOnline

The genome and transcriptome of Trichormus sp NMC-1: insights into adaptation to extreme environments on the Qinghai-Tibet Plateau

Author: A Stamatakis
A Zorina
AL Delcher
B Langmead
BA Methé
C Xie
DA Los
DJ Wright
EP Balskus
G Blanc
G Norsang
HÄ Suh
J Qi
J Qi
J Zhang
JF Hess
JI Carreto
JM Shick
JP Zehr
K Mavromatis
KS Siddiqui
L Li
L R
L Ran
M Borodovsky
M Dassanayake
M Li
M Suyama
N Myers
P Pereira
P Puigbò
P Rajaniemi
PH Sudmant
PM Shih
Q Qiu
Q Tang
R Cavicchioli
RC Edgar
RL Tatusov
S Richter
SP Singh
SP Singh
SP Singh
T De Bie
T Kaneko
T Kogej
T Shi
U Consortium
U Nübel
WM Fitch
Z Xu
Z Yang
Z Yang
ZA Cheviron
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/07/2016
Field of study

The Qinghai-Tibet Plateau (QTP) has the highest biodiversity for an extreme environment worldwide, and provides an ideal natural laboratory to study adaptive evolution. In this study, we generated a draft genome sequence of cyanobacteria Trichormus sp. NMC-1 in the QTP and performed whole transcriptome sequencing under low temperature to investigate the genetic mechanism by which T. sp. NMC-1 adapted to the specific environment. Its genome sequence was 5.9 Mb with a G+C content of 39.2% and encompassed a total of 5362 CDS. A phylogenomic tree indicated that this strain belongs to the Trichormus and Anabaena cluster. Genome comparison between T. sp. NMC-1 and six relatives showed that functionally unknown genes occupied a much higher proportion (28.12%) of the T. sp. NMC-1 genome. In addition, functions of specific, significant positively selected, expanded orthogroups, and differentially expressed genes involved in signal transduction, cell wall/membrane biogenesis, secondary metabolite biosynthesis, and energy production and conversion were analyzed to elucidate specific adaptation traits. Further analyses showed that the CheY-like genes, extracellular polysaccharide and mycosporine-like amino acids might play major roles in adaptation to harsh environments. Our findings indicate that sophisticated genetic mechanisms are involved in cyanobacterial adaptation to the extreme environment of the QTP

Institute of Hydrobiology, Chinese Academy Of Sciences

University of Bedfordshire Repository

Longest Increasing Subsequence under Persistent Comparison Errors

Author: AL Delcher
Barbara Geissmann
CN Potts
D Aldous
E Bachmat
H Zhang
I Yang
J Baik
L Alonso
M Ajtai
M Crochemore
ML Fredman
P Hadjicostas
P Hadjicostas
P Hadjicostas
Peter Damaschke
R Klein
S Bespamyatnikh
S Funke
U Feige
WJ Masek
Publication venue
Publication date: 01/01/2018
Field of study

We study the problem of computing a longest increasing subsequence in a sequence

S

n

distinct elements in the presence of persistent comparison errors. In this model, every comparison between two elements can return the wrong result with some fixed (small) probability

p

, and comparisons cannot be repeated. Computing the longest increasing subsequence exactly is impossible in this model, therefore, the objective is to identify a subsequence that (i) is indeed increasing and (ii) has a length that approximates the length of the longest increasing subsequence. We present asymptotically tight upper and lower bounds on both the approximation factor and the running time. In particular, we present an algorithm that computes an

O(\log n)

-approximation in time

O(n\log n)

, with high probability. This approximation relies on the fact that that we can approximately sort

n

elements in

O(n\log n)

time such that the maximum dislocation of an element is at most

O(\log n)

. For the lower bounds, we prove that (i) there is a set of sequences, such that on a sequence picked randomly from this set every algorithm must return an

\Omega(\log n)

-approximation with high probability, and (ii) any

O(\log n)

-approximation algorithm for longest increasing subsequence requires

\Omega(n \log n)

comparisons, even in the absence of errors

arXiv.org e-Print Archive

Repository for Publications and Research Data

GenePRIMP: a gene prediction improvement pipeline for prokaryotic genomes

Author: A Nagy
AL Delcher
Amrita Pati
Athanasios Lykidis
DA Benson
Galina Ovchinnikova
GX Yu
HQ Zhu
J Besemer
KL Smollett
M Tech
Natalia Mikhailova
Natalia N Ivanova
NC Kyrpides
NE Castellana
Nikos C Kyrpides
RK Aziz
S Bocs
Sean D Hooper
VM Markowitz
Y Ishino
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2010
Field of study

We present 'gene prediction improvement pipeline' (GenePRIMP; http://geneprimp.jgi-psf.org/), a computational process that performs evidence-based evaluation of gene models in prokaryotic genomes and reports anomalies including inconsistent start sites, missed genes and split genes. We found that manual curation of gene models using the anomaly reports generated by GenePRIMP improved their quality, and demonstrate the applicability of GenePRIMP in improving finishing quality and comparing different genome-sequencing and annotation technologies

UNT Digital Library

Sequence and annotation of the Wizard007 mycobacterium phage genome

Author: AL Delcher
Anthony Falcone
Benjamin Howard
Brittney Howard
Claire Rinehart
Courtney Howard
Cynthia Tope
D Gordon
Ejike Anyanwu
Elizabeth Farnsworth
Heidi Sayre
J Besemer
Jordan Olberding
Kaitlyn Cole
Karlee Driver
LD Stein
Mackenzie Perkins
Prasanna Tamarapu Parthasarathy
Rodney King
Sarah Schrader
SE Lewis
SF Altschul
TM Lowe
Tyler Scaff
Publication venue: BioMed Central
Publication date: 01/07/2010
Field of study