93 research outputs found
Microbial co-habitation and lateral gene transfer: what transposases can tell us
Interactions between microbial communities are revealed using a network of lateral gene transfer events
Gene Context Analysis in the Integrated Microbial Genomes (IMG) Data Management System
Computational methods for determining the function of genes in newly sequenced genomes have been traditionally based on sequence similarity to genes whose function has been identified experimentally. Function prediction methods can be extended using gene context analysis approaches such as examining the conservation of chromosomal gene clusters, gene fusion events and co-occurrence profiles across genomes. Context analysis is based on the observation that functionally related genes are often having similar gene context and relies on the identification of such events across phylogenetically diverse collection of genomes. We have used the data management system of the Integrated Microbial Genomes (IMG) as the framework to implement and explore the power of gene context analysis methods because it provides one of the largest available genome integrations. Visualization and search tools to facilitate gene context analysis have been developed and applied across all publicly available archaeal and bacterial genomes in IMG. These computations are now maintained as part of IMG's regular genome content update cycle. IMG is available at: http://img.jgi.doe.gov
Systematic Association of Genes to Phenotypes by Genome and Literature Mining
One of the major challenges of functional genomics is to unravel the connection between genotype and phenotype. So far no global analysis has attempted to explore those connections in the light of the large phenotypic variability seen in nature. Here, we use an unsupervised, systematic approach for associating genes and phenotypic characteristics that combines literature mining with comparative genome analysis. We first mine the MEDLINE literature database for terms that reflect phenotypic similarities of species. Subsequently we predict the likely genomic determinants: genes specifically present in the respective genomes. In a global analysis involving 92 prokaryotic genomes we retrieve 323 clusters containing a total of 2,700 significant gene–phenotype associations. Some clusters contain mostly known relationships, such as genes involved in motility or plant degradation, often with additional hypothetical proteins associated with those phenotypes. Other clusters comprise unexpected associations; for example, a group of terms related to food and spoilage is linked to genes predicted to be involved in bacterial food poisoning. Among the clusters, we observe an enrichment of pathogenicity-related associations, suggesting that the approach reveals many novel genes likely to play a role in infectious diseases
STRING: known and predicted protein–protein associations, integrated and transferred across organisms
A full description of a protein's function requires knowledge of all partner proteins with which it specifically associates. From a functional perspective, ‘association’ can mean direct physical binding, but can also mean indirect interaction such as participation in the same metabolic pathway or cellular process. Currently, information about protein association is scattered over a wide variety of resources and model organisms. STRING aims to simplify access to this information by providing a comprehensive, yet quality-controlled collection of protein–protein associations for a large number of organisms. The associations are derived from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface. Importantly, the associations are extended beyond the organism in which they were originally described, by automatic transfer to orthologous protein pairs in other organisms, where applicable. STRING currently holds 730 000 proteins in 180 fully sequenced organisms, and is available at http://string.embl.de/
Identification of tightly regulated groups of genes during Drosophila melanogaster embryogenesis
Time-series analysis of whole-genome expression data during Drosophila melanogaster development indicates that up to 86% of its genes change their relative transcript level during embryogenesis. By applying conservative filtering criteria and requiring ‘sharp' transcript changes, we identified 1534 maternal genes, 792 transient zygotic genes, and 1053 genes whose transcript levels increase during embryogenesis. Each of these three categories is dominated by groups of genes where all transcript levels increase and/or decrease at similar times, suggesting a common mode of regulation. For example, 34% of the transiently expressed genes fall into three groups, with increased transcript levels between 2.5–12, 11–20, and 15–20 h of development, respectively. We highlight common and distinctive functional features of these expression groups and identify a coupling between downregulation of transcript levels and targeted protein degradation. By mapping the groups to the protein network, we also predict and experimentally confirm new functional associations
The Complete Multipartite Genome Sequence of Cupriavidus necator JMP134, a Versatile Pollutant Degrader
BACKGROUND: Cupriavidus necator JMP134 is a Gram-negative beta-proteobacterium able to grow on a variety of aromatic and chloroaromatic compounds as its sole carbon and energy source. METHODOLOGY/PRINCIPAL FINDINGS: Its genome consists of four replicons (two chromosomes and two plasmids) containing a total of 6631 protein coding genes. Comparative analysis identified 1910 core genes common to the four genomes compared (C. necator JMP134, C. necator H16, C. metallidurans CH34, R. solanacearum GMI1000). Although secondary chromosomes found in the Cupriavidus, Ralstonia, and Burkholderia lineages are all derived from plasmids, analyses of the plasmid partition proteins located on those chromosomes indicate that different plasmids gave rise to the secondary chromosomes in each lineage. The C. necator JMP134 genome contains 300 genes putatively involved in the catabolism of aromatic compounds and encodes most of the central ring-cleavage pathways. This strain also shows additional metabolic capabilities towards alicyclic compounds and the potential for catabolism of almost all proteinogenic amino acids. This remarkable catabolic potential seems to be sustained by a high degree of genetic redundancy, most probably enabling this catabolically versatile bacterium with different levels of metabolic responses and alternative regulation necessary to cope with a challenging environment. From the comparison of Cupriavidus genomes, it is possible to state that a broad metabolic capability is a general trait for Cupriavidus genus, however certain specialization towards a nutritional niche (xenobiotics degradation, chemolithoautotrophy or symbiotic nitrogen fixation) seems to be shaped mostly by the acquisition of "specialized" plasmids. CONCLUSIONS/SIGNIFICANCE: The availability of the complete genome sequence for C. necator JMP134 provides the groundwork for further elucidation of the mechanisms and regulation of chloroaromatic compound biodegradation
Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach
In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online
Estimating DNA coverage and abundance in metagenomes using a gamma approximation
Motivation: Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets
Structural Alterations from Multiple Displacement Amplification of a Human Genome Revealed by Mate-Pair Sequencing
Comprehensive identification of the acquired mutations that cause common cancers will require genomic analyses of large sets of tumor samples. Typically, the tissue material available from tumor specimens is limited, which creates a demand for accurate template amplification. We therefore evaluated whether phi29-mediated whole genome amplification introduces false positive structural mutations by massive mate-pair sequencing of a normal human genome before and after such amplification. Multiple displacement amplification led to a decrease in clone coverage and an increase by two orders of magnitude in the prevalence of inversions, but did not increase the prevalence of translocations. While multiple strand displacement amplification may find uses in translocation analyses, it is likely that alternative amplification strategies need to be developed to meet the demands of cancer genomics
- …