Search CORE

5 research outputs found

Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering

Author: A Bateman
A Nekrutenko
AC McHardy
AG Murzin
CA Orengo
D Fischer
DB Rusch
DH Haft
DL Wheeler
E Birney
ED Harrington
EF DeLong
EF DeLong
F Corpet
F Sanger
FMDL Vega
Granger Sutton
GW Tyson
H Noguchi
H Ochman
J Besemer
J Quackenbush
JA Eisen
JC Venter
K Chen
K Mavromatis
L Krause
L Rychlewski
M Margulies
M Sait
N Siew
R Seshadri
R Unger
RC Edgar
S Yooseph
SF Altschul
SF Altschul
SG Tringe
Shibu Yooseph
SJ Giovannoni
SR Gill
W Li
W Li
W Li
Weizhong Li
Z Yang
Z Yang
Publication venue: BioMed Central
Publication date: 01/04/2008
Field of study

Abstract Background The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools. Results We present a computational improvement to a sequence clustering approach that we developed previously to identify and classify protein coding genes in large microbial metagenomic datasets. The clustering approach can be used to identify protein coding genes in prokaryotes, viruses, and intron-less eukaryotes. The computational improvement is based on an incremental clustering method that does not require the expensive all-against-all compute that was required by the original approach, while still preserving the remote homology detection capabilities. We present evaluations of the clustering approach in protein-coding gene identification and classification, and also present the results of updating the protein clusters from our previous work with recent genomic and metagenomic sequences. The clustering results are available via CAMERA, (http://camera.calit2.net). Conclusion The clustering paradigm is shown to be a very useful tool in the analysis of microbial metagenomic data. The incremental clustering method is shown to be much faster than the original approach in identifying genes, grouping sequences into existing protein families, and also identifying novel families that have multiple members in a metagenomic dataset. These clusters provide a basis for further studies of protein families.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A Primer on Metagenomics

Author: Bourne Philip E.
Friedberg Iddo
Godzik Adam
Wooley John C.
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Metagenomics is a discipline that enables the genomic study of uncultured microorganisms. Faster, cheaper sequencing technologies and the ability to sequence uncultured microbes sampled directly from their habitats are expanding and transforming our view of the microbial world. Distilling meaningful information from the millions of new genomic sequences presents a serious challenge to bioinformaticians. In cultured microbes, the genomic data come from a single clone, making sequence assembly and annotation tractable. In metagenomics, the data come from heterogeneous microbial communities, sometimes containing more than 10,000 species, with the sequence data being noisy and partial. From sampling, to assembly, to gene calling and function prediction, bioinformatics faces new demands in interpreting voluminous, noisy, and often partial sequence data. Although metagenomics is a relative newcomer to science, the past few years have seen an explosion in computational methods applied to metagenomic-based research. It is therefore not within the scope of this article to provide an exhaustive review. Rather, we provide here a concise yet comprehensive introduction to the current computational requirements presented by metagenomics, and review the recent progress made. We also note whether there is software that implements any of the methods presented here, and briefly review its utility. Nevertheless, it would be useful if readers of this article would avail themselves of the comment section provided by this journal, and relate their own experiences. Finally, the last section of this article provides a few representative studies illustrating different facets of recent scientific discoveries made using metagenomics

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

Queensland University of Technology ePrints Archive

eScholarship - University of California

Peptide-based functional annotation of carbohydrate-active enzymes by conserved unique peptide patterns (CUPP)

Author: Barrett Kristian
Lange Lene
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Additional file 1: Figure S1. Selection of c_clust and peptide parameters. Figure S2. N-fold cross validation of GH30. Figure S3 CUPP flowchart. Table S1 Relative RAM requirements as a function of peptide length and number of ambiguos positions. Table S2 GH30 CUPP group validation. Table S3. N-fold cross validation of GH30 using ten partitions

Online Research Database In Technology

FigShare

Recommended from our members

Reconstruction of a Bacterial Genome from DNA Cassettes

Author: Allen Andrew
Allen Lisa Zeigler
Dupont Christopher
Friedman Robert
Glass John
Sheahan Laura
Thiagarajan Mathangi
Venter J. Craig
Yooseph Shibu
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

This basic research program comprised two major areas: (1) acquisition and analysis of marine microbial metagenomic data and development of genomic analysis tools for broad, external community use; (2) development of a minimal bacterial genome. Our Marine Metagenomic Diversity effort generated and analyzed shotgun sequencing data from microbial communities sampled from over 250 sites around the world. About 40% of the 26 Gbp of sequence data has been made publicly available to date with a complete release anticipated in six months. Our results and those mining the deposited data have revealed a vast diversity of genes coding for critical metabolic processes whose phylogenetic and geographic distributions will enable a deeper understanding of carbon and nutrient cycling, microbial ecology, and rapid rate evolutionary processes such as horizontal gene transfer by viruses and plasmids. A global assembly of the generated dataset resulted in a massive set (5Gbp) of genome fragments that provide context to the majority of the generated data that originated from uncultivated organisms. Our Synthetic Biology team has made significant progress towards the goal of synthesizing a minimal mycoplasma genome that will have all of the machinery for independent life. This project, once completed, will provide fundamentally new knowledge about requirements for microbial life and help to lay a basic research foundation for developing microbiological approaches to bioenergy

UNT Digital Library