Search CORE

University of Birmingham Research Portal

LSHTM Research Online

Oxford University Research Archive

Explore Bristol Research

FigShare

Analysis of a viral metagenomic library from 200 m depth in Monterey Bay, California constructed by direct shotgun cloning

Author: AI Culley
AS Lang
C Andrews-Pfannkoch
C Desnues
CA Suttle
Christina M Preston
DJ Lane
DM Kristensen
DR Garza
EF DeLong
EF DeLong
EF DeLong
F Meyer
FE Angly
G Bratbak
GF Steward
Grieg F Steward
J Sambrook
JA Fuhrman
JE Lawrence
JH Paul
JR Brum
KE Wommack
KE Wommack
KL Marhaver
L McDaniel
LP Villareal
M Breitbart
M Breitbart
M Breitbart
M Kanehisa
M Könneke
M Suzuki
MJ Allen
MJ Roossinck
MM Gowing
MW Silver
RA Edwards
RD Finn
RR Helton
SE Wilson
SF Altschul
SJ Williamson
SR Bench
TB Stanton
TJ Mincer
WP Cochlan
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Viruses have a profound influence on both the ecology and evolution of marine plankton, but the genetic diversity of viral assemblages, particularly those in deeper ocean waters, remains poorly described. Here we report on the construction and analysis of a viral metagenome prepared from below the euphotic zone in a temperate, eutrophic bay of coastal California. Methods We purified viruses from approximately one cubic meter of seawater collected from 200m depth in Monterey Bay, CA. DNA was extracted from the virus fraction, sheared, and cloned with no prior amplification into a plasmid vector and propagated in <it>E. coli </it>to produce the MBv200m library. Random clones were sequenced by the Sanger method. Sequences were assembled then compared to sequences in GenBank and to other viral metagenomic libraries using BLAST analyses. Results Only 26% of the 881 sequences remaining after assembly had significant (E ≤ 0.001) BLAST hits to sequences in the GenBank nr database, with most being matches to bacteria (15%) and viruses (8%). When BLAST analysis included environmental sequences, 74% of sequences in the MBv200m library had a significant match. Most of these hits (70%) were to microbial metagenome sequences and only 0.7% were to sequences from viral metagenomes. Of the 121 sequences with a significant hit to a known virus, 94% matched bacteriophages (Families <it>Podo</it>-, <it>Sipho</it>-, and <it>Myoviridae</it>) and 6% matched viruses of eukaryotes in the Family <it>Phycodnaviridae </it>(5 sequences) or the Mimivirus (2 sequences). The largest percentages of hits to viral genes of known function were to those involved in DNA modification (25%) or structural genes (17%). Based on reciprocal BLAST analyses, the MBv200m library appeared to be most similar to viral metagenomes from two other bays and least similar to a viral metagenome from the Arctic Ocean. Conclusions Direct cloning of DNA from diverse marine viruses was feasible and resulted in a distribution of virus types and functional genes at depth that differed in detail, but were broadly similar to those found in surface marine waters. Targeted viral analyses are useful for identifying those components of the greater marine metagenome that circulate in the subcellular size fraction.</p

Analysis and comparison of very large metagenomes with fast clustering and functional annotation

Author: AC McHardy
AR Quinlan
B Rodriguez-Brito
D Sheskin
DB Rusch
DC Richter
DH Huson
E Portugaly
EA Dinsdale
EF DeLong
FE Angly
GW Tyson
H Noguchi
H Noguchi
H Teeling
H Teeling
J Shendure
JC Venter
K Mavromatis
KJ Hoff
L Krause
PD Schloss
R Seshadri
RK Aziz
S Yooseph
S Yooseph
SF Altschul
SG Tringe
SR Eddy
SR Gill
W Li
W Li
W Li
W Li
Weizhong Li
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The remarkable advance of metagenomics presents significant new challenges in data analysis. Metagenomic datasets (metagenomes) are large collections of sequencing reads from anonymous species within particular environments. Computational analyses for very large metagenomes are extremely time-consuming, and there are often many novel sequences in these metagenomes that are not fully utilized. The number of available metagenomes is rapidly increasing, so fast and efficient metagenome comparison methods are in great demand. Results The new metagenomic data analysis method Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was developed using an ultra-fast sequence clustering algorithm, fast protein family annotation tools, and a novel statistical metagenome comparison method that employs a unique graphic interface. RAMMCAP processes extremely large datasets with only moderate computational effort. It identifies raw read clusters and protein clusters that may include novel gene families, and compares metagenomes using clusters or functional annotations calculated by RAMMCAP. In this study, RAMMCAP was applied to the two largest available metagenomic collections, the "Global Ocean Sampling" and the "Metagenomic Profiling of Nine Biomes". Conclusion RAMMCAP is a very fast method that can cluster and annotate one million metagenomic reads in only hundreds of CPU hours. It is available from <url>http://tools.camera.calit2.net/camera/rammcap/</url>.</p

Public Library of Science (PLOS)

Predicting Quantitative Genetic Interactions by Means of Sequential Matrix Approximation

Author: A Beyer
A Hintze
AH Tong
Aki P. Järvinen
B Lehner
BL Drees
C Boone
D Segrè
ER DeLong
GD Bader
I Ulitsky
I Ulitsky
J De Leeuw
JL Badano
JL Hartman
Jukka Hiissa
L Decourty
Laura L. Elo
M Schuldiner
O Dror
P Pudil
P Ye
R Kelley
R Mani
RJ Taylor
RP St Onge
S Axler
S Bandyopadhyay
Shin-Han Shiu
SL Ooi
SL Wong
SR Collins
SR Collins
Tero Aittokallio
X Pan
Publication venue: Public Library of Science
Publication date: 26/09/2008
Field of study

Despite the emerging experimental techniques for perturbing multiple genes and measuring their quantitative phenotypic effects, genetic interactions have remained extremely difficult to predict on a large scale. Using a recent high-resolution screen of genetic interactions in yeast as a case study, we investigated whether the extraction of pertinent information encoded in the quantitative phenotypic measurements could be improved by computational means. By taking advantage of the observation that most gene pairs in the genetic interaction screens have no significant interactions with each other, we developed a sequential approximation procedure which ranks the mutation pairs in order of evidence for a genetic interaction. The sequential approximations can efficiently remove background variation in the double-mutation screens and give increasingly accurate estimates of the single-mutant fitness measurements. Interestingly, these estimates not only provide predictions for genetic interactions which are consistent with those obtained using the measured fitness, but they can even significantly improve the accuracy with which one can distinguish functionally-related gene pairs from the non-interacting pairs. The computational approach, in general, enables an efficient exploration and classification of genetic interactions in other studies and systems as well

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Author: A Bateman
A Nekrutenko
AC McHardy
AG Murzin
CA Orengo
D Fischer
DB Rusch
DH Haft
DL Wheeler
E Birney
ED Harrington
EF DeLong
EF DeLong
F Corpet
F Sanger
FMDL Vega
Granger Sutton
GW Tyson
H Noguchi
H Ochman
J Besemer
J Quackenbush
JA Eisen
JC Venter
K Chen
K Mavromatis
L Krause
L Rychlewski
M Margulies
M Sait
N Siew
R Seshadri
R Unger
RC Edgar
S Yooseph
SF Altschul
SF Altschul
SG Tringe
Shibu Yooseph
SJ Giovannoni
SR Gill
W Li
W Li
W Li
Weizhong Li
Z Yang
Z Yang
Publication venue: BioMed Central
Publication date: 01/04/2008
Field of study

Abstract Background The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools. Results We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net). Conclusion The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.</p

Public Library of Science (PLOS)

Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets

Author: A Krogh
A Lupas
A Sali
AC McHardy
Adam Godzik
AJ Enright
AJ Enright
B Rodriguez-Brito
BE Suzek
David Jones
DB Rusch
DH Huson
EF DeLong
FE Angly
G Yona
GW Tyson
J Park
JA Cuff
JC Venter
JD Bendtsen
JD Thompson
John C. Wooley
K Mavromatis
L Holm
L Krause
L Rychlewski
ML Tress
O Sasson
P Pipenbacher
PD Schloss
R Apweiler
RL Tatusov
S Mika
S Yooseph
SF Altschul
SG Tringe
SR Eddy
SR Gill
U Hobohm
W Li
W Li
W Li
W Li
Weizhong Li
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

BACKGROUND: The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods. METHODOLOGY/PRINCIPAL FINDINGS: In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations. CONCLUSION/SIGNIFICANCE: Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project

eScholarship - University of California

A Nitrile Hydratase in the Eukaryote Monosiga brevicollis

Author: A Banerjee
AJ Blakeya
BE Suzek
BR Langdahl
CJ Creevey
DB Rusch
EF DeLong
GW Tyson
HG Martin
I Endo
I Letunic
J Felsenstein
JA Colquhoun
JA Kovacs
Jean Muller
Jeroen Raes
K Narayanasamy
K Rutherford
Konrad U. Foerstner
KR Buck
M Kobayashi
M Nakasako
MA Larkin
MJ DiGeronimo
N King
N Layh
NM Shaw
Peer Bork
PFB Brandao
RA Pereira
RC Edgar
RD Finn
S Mitra
SG Tringe
Sp Guindon
SR Eddy
SR Gill
Sridhar Hannenhalli
T Arakawa
T Nagasawa
TC Harrop
Tobias Doerks
W Huang
Y Toshifumi
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Bacterial nitrile hydratase (NHases) are important industrial catalysts and waste water remediation tools. In a global computational screening of conventional and metagenomic sequence data for NHases, we detected the two usually separated NHase subunits fused in one protein of the choanoflagellate Monosiga brevicollis, a recently sequenced unicellular model organism from the closest sister group of Metazoa. This is the first time that an NHase is found in eukaryotes and the first time it is observed as a fusion protein. The presence of an intron, subunit fusion and expressed sequence tags covering parts of the gene exclude contamination and suggest a functional gene. Phylogenetic analyses and genomic context imply a probable ancient horizontal gene transfer (HGT) from proteobacteria. The newly discovered NHase might open biotechnological routes due to its unconventional structure, its new type of host and its apparent integration into eukaryotic protein networks

MDC Repository

A statistical toolbox for metagenomics: assessing functional diversity in microbial communities

Author: A Chao
A Chao
A Chao
A Chao
A Chao
AC McHardy
AE Magurran
AP Martin
B Rodriguez-Brito
C von Mering
CS Riesenfeld
DA Rasko
DB Rusch
DH Huson
DR Singleton
E Lerat
EF DeLong
EF Delong
GW Tyson
H Garcia Martin
H Teeling
H Teeling
JC Venter
JC Yue
JL Stein
Jo Handelsman
JP Wang
K Mavromatis
KP Burnham
KU Foerstner
L Excoffier
M Breitbart
M Breitbart
M Margulies
M Strous
MJ Anderson
MR Rondon
P Legendre
Patrick D Schloss
PD Schloss
PD Schloss
PD Schloss
PD Schloss
PD Schloss
PL Johnson
S Yooseph
SG Tringe
SJ Hallam
SJ Hallam
SR Gill
T Woyke
TD Read
TM Schmidt
VM Markowitz
W Ludwig
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The 99% of bacteria in the environment that are recalcitrant to culturing have spurred the development of metagenomics, a culture-independent approach to sample and characterize microbial genomes. Massive datasets of metagenomic sequences have been accumulated, but analysis of these sequences has focused primarily on the descriptive comparison of the relative abundance of proteins that belong to specific functional categories. More robust statistical methods are needed to make inferences from metagenomic data. In this study, we developed and applied a suite of tools to describe and compare the richness, membership, and structure of microbial communities using peptide fragment sequences extracted from metagenomic sequence data. Results Application of these tools to acid mine drainage, soil, and whale fall metagenomic sequence collections revealed groups of peptide fragments with a relatively high abundance and no known function. When combined with analysis of 16S rRNA gene fragments from the same communities these tools enabled us to demonstrate that although there was no overlap in the types of 16S rRNA gene sequence observed, there was a core collection of operational protein families that was shared among the three environments. Conclusion The results of comparisons between the three habitats were surprising considering the relatively low overlap of membership and the distinctively different characteristics of the three habitats. These tools will facilitate the use of metagenomics to pursue statistically sound genome-based ecological analyses.</p