Search CORE

eScholarship - University of California

Caltech Authors

Local alignment of two-base encoded DNA sequence

Author: A Izmailov
A Izmailov
B Ewing
B Ewing
B Ma
Barry Merriman
DR Powell
DR Smith
DS Hirschberg
EW Myers
H Li
N Jones
Nils Homer
O Gotoh
R Hamming
R Li
S Levy
SB Needleman
SF Altschul
SM Rumble
ST Sherry
Stanley F Nelson
TF Smith
VI Levenshtein
W Ewans
WJ Kent
X Huang
Z Ning
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity. Results We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions. Conclusion The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.</p

Springer - Publisher Connector

Springer

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Author: A Edwards
A Edwards
B Ewing
D Hernandez
DA Wheeler
Daniel S. Rokhsar
DR Bentley
DR Bentley
DR Smith
DR Zerbino
DR Zerbino
ES Lander
EW Myers
EW Myers
EW Myers
Gary P. Schroth
GG Sutton
I Maccallum
Isaac Ho
J Butler
Jarrod A. Chapman
JC Roach
JL Weber
JT Simpson
K Hayashi
M Chaisson
M Margulies
M Pop
M Pop
MJ Chaisson
MJ Chaisson
ML Metzker
P Flicek
PA Pevzner
R Li
R Li
RL Warren
RM Idury
SC Schuster
SF Altschul
Shujun Luo
Sirisha Sunkara
Steven L. Salzberg
TW Jeffries
TW Jeffries
Publication venue: Public Library of Science
Publication date: 01/08/2011
Field of study

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed

UNT Digital Library

Sequences, Annotation and Single Nucleotide Polymorphism of the Major Histocompatibility Complex in the Domestic Cat

Author: AJ Pearks Wilkerson
AL Roca
AL Roca
B Ewing
B Ewing
BH Koller
C Burge
CA Stewart
D Gordon
D Gordon
E Birney
EW Brown
Hans Ellegren
HL Niman
J Klein
J Loconto
J Pontius
JA Traherne
James C. Mullikin
JC Mullikin
JL Troyer
JL Troyer
K Okita
K Takahashi
M Krawczyk
MA Carpenter
N Yuhki
N Yuhki
N Yuhki
N Yuhki
N Yuhki
Naoya Yuhki
PJ van den Elsen
R Horton
R Horton
RJ Allcock
RM Younger
Robert Stephens
S Schwartz
S Schwartz
SF Altschul
SJ Wheelan
Stephen J. O'Brien
T Beck
TA Tatusova
Thomas Beck
TW Beck
V Klement
WA Nelson-Rees
WD Hardy
Publication venue: Public Library of Science
Publication date: 01/06/2008
Field of study

Two sequences of major histocompatibility complex (MHC) regions in the domestic cat, 2.976 and 0.362 Mbps, which were separated by an ancient chromosome break (55–80 MYA) and followed by a chromosomal inversion were annotated in detail. Gene annotation of this MHC was completed and identified 183 possible coding regions, 147 human homologues, possible functional genes and 36 pseudo/unidentified genes) by GENSCAN and BLASTN, BLASTP RepeatMasker programs. The first region spans 2.976 Mbp sequence, which encodes six classical class II antigens (three DRA and three DRB antigens) lacking the functional DP, DQ regions, nine antigen processing molecules (DOA/DOB, DMA/DMB, TAPASIN, and LMP2/LMP7,TAP1/TAP2), 52 class III genes, nineteen class I genes/gene fragments (FLAI-A to FLAI-S). Three class I genes (FLAI-H, I-K, I-E) may encode functional classical class I antigens based on deduced amino acid sequence and promoter structure. The second region spans 0.362 Mbp sequence encoding no class I genes and 18 cross-species conserved genes, excluding class I, II and their functionally related/associated genes, namely framework genes, including three olfactory receptor genes. One previously identified feline endogenous retrovirus, a baboon retrovirus derived sequence (ECE1) and two new endogenous retrovirus sequences, similar to brown bat endogenous retrovirus (FERVmlu1, FERVmlu2) were found within a 140 Kbp interval in the middle of class I region. MHC SNPs were examined based on comparisons of this BAC sequence and MHC homozygous 1.9× WGS sequences and found that 11,654 SNPs in 2.84 Mbp (0.00411 SNP per bp), which is 2.4 times higher rate than average heterozygous region in the WGS (0.0017 SNP per bp genome), and slightly higher than the SNP rate observed in human MHC (0.00337 SNP per bp)

NSU Works

A comparative evaluation of various invasion assays testing colon carcinoma cell lines

Author: A Albini
A Fabra
AB Sparks
AC Noel
AM Sieuwerts
B Boyer
C Chao
C Gilles
C Sommers
CM Ewing
EW Thompson
F T Bosman
G Webb
H Sato
I Sunitha
JE de Vries
K Morikawa
K Vleminckx
KW Kinzler
LA Kunz-Schughart
M Jeffers
M Polette
M Vermey
MM Mareel
MM Mareel
N J de Both
NJ de Both
R Kath
RA Morton
RS Bresalier
S Meinders
SJ Vermeulen
TH Corbett
W N Dinjens
Publication venue: Nature Publishing Group
Publication date: 01/01/1999
Field of study

Various colon carcinoma cell lines were tested in different invasion assays, i.e. invasion into Matrigel, into confluent fibroblast layers and into chicken heart tissue. Furthermore, invasive capacity and metastatic potential were determined in nude mice. The colon carcinoma cells used were the human cell lines Caco-2, SW-480, SW-620 and HT-29, and the murine lines Colon-26 and -38. None of the human colon carcinoma cells migrated through porous membranes coated with Matrigel; of the murine lines, only Colon-26 did. When incubated in a mixture of Matrigel and culture medium non-invading cells formed spheroid cultures, whereas invading cells showed a stellate outgrowth. Only the heterogeneously shaped (epithelioid and stellate) cells of SW-480 and SW-620 and the spindle-shaped cells of Colon-26 invaded clearly confluent skin and colon fibroblasts as well as chicken heart tissue. However, when transplanted into the caecum of nude and syngeneic mice, all the lines tested were invasive with the exception of Caco-2 cells. We conclude that the outcome of in vitro tests measuring the invasive capacity of neoplastic cells is largely dependent on the test system used. Invasive capacity in vitro is strongly correlated with cells having a spindle cell shape, vimentin expression and E-cadherin down regulation. In contrast, HT-29 and Colon-38 cells having an epithelioid phenotype were clearly invasive and metastatic in vivo, but not in vitro. © 1999 Cancer Research Campaig

EUR Research Repository

Physical activity and depressive symptoms in community-dwelling elders from southern Brazil

Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data

Author: AH Singh
Aino I. Järvelin
Alison S. Waller
B Ewing
B Ewing
CB Abulencia
D Chivian
D Wu
Daniel R. Mende
DC Richter
DR Zerbino
ED Harrington
ES Lander
EW Myers
F Meyer
FE Angly
FE Angly
GW Tyson
H García Martín
H-H Chou
J Goecks
J Goll
J Handelsman
J Muller
J Peterson
J Qin
J Raes
J Raes
JC Venter
Jeroen Raes
John Parkinson
JR Miller
JR Miller
K Kurokawa
K Mavromatis
M Arumugam
M Arumugam
M Pignatelli
M Pop
Manimozhiyan Arumugam
Michelle M. Chan
MP Cox
Peer Bork
PJ Turnbaugh
PJA Cock
R Li
R Li
R Schmieder
RA Edwards
RL Warren
S Aparicio
SG Tringe
Shinichi Sunagawa
SR Gill
T Schoenfeld
TA Gianoulis
TC Glenn
VM Markowitz
W Zhu
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available

CiteSeerX

Copenhagen University Research Information System

MDC Repository

FigShare

ReRep: Computational detection of repetitive sequences in genome survey sequences (GSS)

Author: AL Pedrosa
Antonio B de Miranda
B Clift
B Ewing
B Wickstead
CS Peacock
D Gordon
E Arner
EC Laurentino
ES Lander
EW Myers
G Fu
G Fu
IHGS Consortium
J Healy
J Jurka
J Wang
JD Thompson
K Reinert
K Swaminathan
Leonardo HF Gomes
M Margulies
Marcelo Alves-Ferreira
N Rodriguez
N Volfovsky
NM El-Sayed
P Rice
PA Pevzner
R Szklarczyk
RA Hoskins
S Kurtz
S Kurtz
SF Altschul
SM Sunkin
TD Otto
Thomas D Otto
Wim M Degrave
X Huang
Z Bao
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Genome survey sequences (GSS) offer a preliminary global view of a genome since, unlike ESTs, they cover coding as well as non-coding DNA and include repetitive regions of the genome. A more precise estimation of the nature, quantity and variability of repetitive sequences very early in a genome sequencing project is of considerable importance, as such data strongly influence the estimation of genome coverage, library quality and progress in scaffold construction. Also, the elimination of repetitive sequences from the initial assembly process is important to avoid errors and unnecessary complexity. Repetitive sequences are also of interest in a variety of other studies, for instance as molecular markers. Results We designed and implemented a straightforward pipeline called ReRep, which combines bioinformatics tools for identifying repetitive structures in a GSS dataset. In a case study, we first applied the pipeline to a set of 970 GSSs, sequenced in our laboratory from the human pathogen <it>Leishmania braziliensis</it>, the causative agent of leishmaniosis, an important public health problem in Brazil. We also verified the applicability of ReRep to new sequencing technologies using a set of 454-reads of an <it>Escheria coli</it>. The behaviour of several parameters in the algorithm is evaluated and suggestions are made for tuning of the analysis. Conclusion The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a <it>L. braziliensis </it>GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic dataset from the EMBL and Sanger Centre databases. ReRep also identified most of the <it>E. coli </it>K12 repeats prior to assembly in an example dataset obtained by automated sequencing using 454 technology. The parameters controlling the algorithm behaved consistently and may be tuned to the properties of the dataset, in particular to the length of sequencing reads and the genome coverage. ReRep is freely available for academic use at <url>http://bioinfo.pdtis.fiocruz.br/ReRep/</url>.</p

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Enlighten

Discovering cancer genes by integrating network and functional properties

Author: A Ergun
A Hamosh
A Shearn
A Yokoyama
A-L Barabasi
AW Whitehurst
C Alfarano
C Cortes
C Greenman
D Maglott
David P Davis
DW Litchfield
EW Sayers
F Natt
G Joshi-Tope
GS Stewart
HB Fraser
HY Chuang
J Luscher-Firzlaff
JA Hanley
James Lee
JS Kaminker
K Lage
Kangyu Zhang
L Franke
Li Li
M Kanehisa
M Yu
MA Harris
O Kim
P Aza-Blanc
PA Futreal
PF Jonsson
Q Cui
R Bergholdt
RD Finn
RK Thomas
RM Ewing
S Forbes
S Pan
S Peri
S Wachi
Shaun Cordes
SJ Furney
T Sjoblom
W-H Li
Y Ohta
Z Tu
Zhijun Tang
Publication venue: BioMed Central
Publication date: 01/09/2009
Field of study

Abstract Background Identification of novel cancer-causing genes is one of the main goals in cancer research. The rapid accumulation of genome-wide protein-protein interaction (PPI) data in humans has provided a new basis for studying the topological features of cancer genes in cellular networks. It is important to integrate multiple genomic data sources, including PPI networks, protein domains and Gene Ontology (GO) annotations, to facilitate the identification of cancer genes. Methods Topological features of the PPI network, as well as protein domain compositions, enrichment of gene ontology categories, sequence and evolutionary conservation features were extracted and compared between cancer genes and other genes. The predictive power of various classifiers for identification of cancer genes was evaluated by cross validation. Experimental validation of a subset of the prediction results was conducted using siRNA knockdown and viability assays in human colon cancer cell line DLD-1. Results Cross validation demonstrated advantageous performance of classifiers based on support vector machines (SVMs) with the inclusion of the topological features from the PPI network, protein domain compositions and GO annotations. We then applied the trained SVM classifier to human genes to prioritize putative cancer genes. siRNA knock-down of several SVM predicted cancer genes displayed greatly reduced cell viability in human colon cancer cell line DLD-1. Conclusion Topological features of PPI networks, protein domain compositions and GO annotations are good predictors of cancer genes. The SVM classifier integrates multiple features and as such is useful for prioritizing candidate cancer genes for experimental validations.</p

Springer - Publisher Connector

A Novel Framework for the Comparative Analysis of Biological Networks

Author: A Califano
A Chatr-aryamontri
A Ruepp
A-L Barabási
AP Cootes
BP Kelley
C-S Liao
D Juan
DA Altomare
E Hirsh
ES Lander
EW Dijkstra
F Pazos
G Cesareni
GA Cope
H Häcker
H Yu
HB Fraser
J Flannick
J-F Rual
JC Venter
K Venkatesan
KM Nicholson
L Kiemer
L Zhenping
M Foiani
M Kalaev
M Kanehisa
M Koyutürk
M Narayanan
ME Smoot
N Wei
NV Grishin
P Beltrao
P Flicek
Patrick Aloy
Philip M. Kim
R Sharan
R Sharan
R Singh
RA Pache
RM Ewing
Roland A. Pache
S Bandyopadhyay
S Kerrien
S Zhang
SA Teichmann
SF Altschul
T Pawson
TS Keshava Prasad
U Consortium
U Güldener
U Stelzl
Y Chen
Y Cheng
YI Pavlov
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Genome sequencing projects provide nearly complete lists of the individual components present in an organism, but reveal little about how they work together. Follow-up initiatives have deciphered thousands of dynamic and context-dependent interrelationships between gene products that need to be analyzed with novel bioinformatics approaches able to capture their complex emerging properties. Here, we present a novel framework for the alignment and comparative analysis of biological networks of arbitrary topology. Our strategy includes the prediction of likely conserved interactions, based on evolutionary distances, to counter the high number of missing interactions in the current interactome networks, and a fast assessment of the statistical significance of individual alignment solutions, which vastly increases its performance with respect to existing tools. Finally, we illustrate the biological significance of the results through the identification of novel complex components and potential cases of cross-talk between pathways and alternative signaling routes

CiteSeerX