55 research outputs found
A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database
BACKGROUND: We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community. DISCUSSION: The PhIGs database currently contains 23 completely sequenced genomes of fungi and metazoans, containing 409,653 genes that have been grouped into 42,645 gene clusters. Each gene cluster is built such that the gene sequence distances are consistent with the known organismal relationships and in so doing, maximizing the likelihood for the clusters to represent truly orthologous genes. The PhIGs website contains tools that allow the study of genes within their phylogenetic framework through keyword searches on annotations, such as GO and InterPro assignments, and sequence similarity searches by BLAST and HMM. In addition to displaying the evolutionary relationships of the genes in each cluster, the website also allows users to view the relative physical positions of homologous genes in specified sets of genomes. SUMMARY: Accurate analyses of genes and genomes can only be done within their full phylogenetic context. The PhIGs database and corresponding website address this problem for the scientific community. Our goal is to expand the content as more genomes are sequenced and use this framework to incorporate more analyses
FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments
Background: We recently described FastTree, a tool for inferring phylogenies for alignments with up to hundreds of thousands of sequences. Here, we describe improvements to FastTree that improve its accuracy without sacrificing scalability. Methodology/Principal Findings: Where FastTree 1 used nearest-neighbor interchanges (NNIs) and the minimum-evolution criterion to improve the tree, FastTree 2 adds minimum-evolution subtree-pruning-regrafting (SPRs) and maximumlikelihood NNIs. FastTree 2 uses heuristics to restrict the search for better trees and estimates a rate of evolution for each site (the ‘‘CAT’ ’ approximation). Nevertheless, for both simulated and genuine alignments, FastTree 2 is slightly more accurate than a standard implementation of maximum-likelihood NNIs (PhyML 3 with default settings). Although FastTree 2 is not quite as accurate as methods that use maximum-likelihood SPRs, most of the splits that disagree are poorly supported, and for large alignments, FastTree 2 is 100–1,000 times faster. FastTree 2 inferred a topology and likelihood-based local support values for 237,882 distinct 16S ribosomal RNAs on a desktop computer in 22 hours and 5.8 gigabytes of memory. Conclusions/Significance: FastTree 2 allows the inference of maximum-likelihood phylogenies for huge alignments
Horizontal gene transfer and the evolution of transcriptional regulation in Escherichia coli
Most Escherichia coli transcription factors have paralogs, but these usually arose by horizontal gene transfer rather than by duplication within the E. coli lineage, as previously believed
Orthologous Transcription Factors in Bacteria Have Different Functions and Regulate Different Genes
Transcription factors (TFs) form large paralogous gene families and have complex evolutionary histories. Here, we ask whether putative orthologs of TFs, from bidirectional best BLAST hits (BBHs), are evolutionary orthologs with conserved functions. We show that BBHs of TFs from distantly related bacteria are usually not evolutionary orthologs. Furthermore, the false orthologs usually respond to different signals and regulate distinct pathways, while the few BBHs that are evolutionary orthologs do have conserved functions. To test the conservation of regulatory interactions, we analyze expression patterns. We find that regulatory relationships between TFs and their regulated genes are usually not conserved for BBHs in Escherichia coli K12 and Bacillus subtilis. Even in the much more closely related bacteria Vibrio cholerae and Shewanella oneidensis MR-1, predicting regulation from E. coli BBHs has high error rates. Using gene–regulon correlations, we identify genes whose expression pattern differs between E. coli and S. oneidensis. Using literature searches and sequence analysis, we show that these changes in expression patterns reflect changes in gene regulation, even for evolutionary orthologs. We conclude that the evolution of bacterial regulation should be analyzed with phylogenetic trees, rather than BBHs, and that bacterial regulatory networks evolve more rapidly than previously thought
RegPrecise web services interface: programmatic access to the transcriptional regulatory interactions in bacteria reconstructed by comparative genomics.
Web services application programming interface (API) was developed to provide a programmatic access to the regulatory interactions accumulated in the RegPrecise database (http://regprecise.lbl.gov), a core resource on transcriptional regulation for the microbial domain of the Department of Energy (DOE) Systems Biology Knowledgebase. RegPrecise captures and visualize regulogs, sets of genes controlled by orthologous regulators in several closely related bacterial genomes, that were reconstructed by comparative genomics. The current release of RegPrecise 2.0 includes >1400 regulogs controlled either by protein transcription factors or by conserved ribonucleic acid regulatory motifs in >250 genomes from 24 taxonomic groups of bacteria. The reference regulons accumulated in RegPrecise can serve as a basis for automatic annotation of regulatory interactions in newly sequenced genomes. The developed API provides an efficient access to the RegPrecise data by a comprehensive set of 14 web service resources. The RegPrecise web services API is freely accessible at http://regprecise.lbl.gov/RegPrecise/services.jsp with no login requirements
Fast Tree: Computing Large Minimum-Evolution Trees with Profiles instead of a Distance Matrix
Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement neighbor-joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest-neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N^2) space and O(N^2 L) time, but FastTree requires just O( NLa + N sqrt(N) ) memory and O( N sqrt(N) log(N) L a ) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 hours and 2.4 gigabytes of memory. Just computing pairwise Jukes-Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 hours and 50 gigabytes of memory. In simulations, FastTree was slightly more accurate than neighbor joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree
Systematic mapping of two component response regulators to gene targets in a model sulfate reducing bacterium
BackgroundTwo component regulatory systems are the primary form of signal transduction in bacteria. Although genomic binding sites have been determined for several eukaryotic and bacterial transcription factors, comprehensive identification of gene targets of two component response regulators remains challenging due to the lack of knowledge of the signals required for their activation. We focused our study on Desulfovibrio vulgaris Hildenborough, a sulfate reducing bacterium that encodes unusually diverse and largely uncharacterized two component signal transduction systems.ResultsWe report the first systematic mapping of the genes regulated by all transcriptionally acting response regulators in a single bacterium. Our results enabled functional predictions for several response regulators and include key processes of carbon, nitrogen and energy metabolism, cell motility and biofilm formation, and responses to stresses such as nitrite, low potassium and phosphate starvation. Our study also led to the prediction of new genes and regulatory networks, which found corroboration in a compendium of transcriptome data available for D. vulgaris. For several regulators we predicted and experimentally verified the binding site motifs, most of which were discovered as part of this study.ConclusionsThe gene targets identified for the response regulators allowed strong functional predictions to be made for the corresponding two component systems. By tracking the D. vulgaris regulators and their motifs outside the Desulfovibrio spp. we provide testable hypotheses regarding the functions of orthologous regulators in other organisms. The in vitro array based method optimized here is generally applicable for the study of such systems in all organisms
FastBLAST: Homology Relationships for Millions of Proteins
BackgroundAll-versus-all BLAST, which searches for homologous pairs of sequences in a database of proteins, is used to identify potential orthologs, to find new protein families, and to provide rapid access to these homology relationships. As DNA sequencing accelerates and data sets grow, all-versus-all BLAST has become computationally demanding.Methodology/principal findingsWe present FastBLAST, a heuristic replacement for all-versus-all BLAST that relies on alignments of proteins to known families, obtained from tools such as PSI-BLAST and HMMer. FastBLAST avoids most of the work of all-versus-all BLAST by taking advantage of these alignments and by clustering similar sequences. FastBLAST runs in two stages: the first stage identifies additional families and aligns them, and the second stage quickly identifies the homologs of a query sequence, based on the alignments of the families, before generating pairwise alignments. On 6.53 million proteins from the non-redundant Genbank database ("NR"), FastBLAST identifies new families 25 times faster than all-versus-all BLAST. Once the first stage is completed, FastBLAST identifies homologs for the average query in less than 5 seconds (8.6 times faster than BLAST) and gives nearly identical results. For hits above 70 bits, FastBLAST identifies 98% of the top 3,250 hits per query.Conclusions/significanceFastBLAST enables research groups that do not have supercomputers to analyze large protein sequence data sets. FastBLAST is open source software and is available at http://microbesonline.org/fastblast
Recommended from our members
High-Quality Draft Genome Sequence of Desulfovibrio carbinoliphilus FW-101-2B, an Organic Acid-Oxidizing Sulfate-Reducing Bacterium Isolated from Uranium(VI)-Contaminated Groundwater.
Desulfovibrio carbinoliphilus subsp. oakridgensis FW-101-2B is an anaerobic, organic acid/alcohol-oxidizing, sulfate-reducing δ-proteobacterium. FW-101-2B was isolated from contaminated groundwater at The Field Research Center at Oak Ridge National Lab after in situ stimulation for heavy metal-reducing conditions. The genome will help elucidate the metabolic potential of sulfate-reducing bacteria during uranium reduction
Mechanisms of Thermal Adaptation Revealed From the Genomes of the Antarctic Archaea Methanogenium frigidum and Methanococcoides burtonii
We generated draft genome sequences for two cold-adapted Archaea, Methanogenium frigidum and Methanococcoides burtonii, to identify genotypic characteristics that distinguish them from Archaea with a higher optimal growth temperature (OGT). Comparative genomics revealed trends in amino acid and tRNA composition, and structural features of proteins. Proteins from the cold-adapted Archaea are characterized by a higher content of noncharged polar amino acids, particularly Gin and Thr and a lower content of hydrophobic amino acids, particularly Leu. Sequence data from nine methanogen genomes (OGT 15degrees-98degreesC) were used to generate IIII modeled protein structures. Analysis of the models from the cold-adapted Archaea showed a strong tendency in the solvent-accessible area for more Gin, Thr, and hydrophobic residues and fewer charged residues. A cold shock domain (CSD) protein (CspA homolog) was identified in M. frigidum, two hypothetical proteins with CSD-folds in M. burtonii, and a unique winged helix DNA-binding domain protein in M. burtonii. This suggests that these types of nucleic acid binding proteins have a critical role in cold-adapted Archaea. Structural analysis of tRNA sequences from the Archaea indicated that GC content is the major factor influencing tRNA stability in hyperthermophiles, but not in the psychrophiles, mesophiles or moderate thermophiles. Below an OGT of 60degreesC, the GC content in tRNA was largely unchanged, indicating that any requirement for flexibility of tRNA in psychrophiles is mediated by other means. This is the first time that comparisons have been performed with genome data from Archaea spanning the growth temperature extremes. from psychrophiles to hyperthermophile
- …