13 research outputs found

    Extension of the COG and arCOG databases by amino acid and nucleotide sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries.</p> <p>Results</p> <p>Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at <url>http://www.uni-wh.de/nucocog</url>.</p> <p>Conclusion</p> <p>NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document.</p

    Biology of archaea from a novel family Cuniculiplasmataceae (Thermoplasmata) ubiquitous in hyperacidic environments

    Get PDF
    The order Thermoplasmatales (Euryarchaeota) is represented by the most acidophilic organisms known so far that are poorly amenable to cultivation. Earlier culture-independent studies in Iron Mountain (California) pointed at an abundant archaeal group, dubbed 'G-plasma'. We examined the genomes and physiology of two cultured representatives of a Family Cuniculiplasmataceae, recently isolated from acidic (pH 1-1.5) sites in Spain and UK that are 16S rRNA gene sequence-identical with 'G-plasma'. Organisms had largest genomes among Thermoplasmatales (1.87-1.94 Mbp), that shared 98.7-98.8% average nucleotide identities between themselves and 'G-plasma' and exhibited a high genome conservation even within their genomic islands, despite their remote geographical localisations. Facultatively anaerobic heterotrophs, they possess an ancestral form of A-type terminal oxygen reductase from a distinct parental clade. The lack of complete pathways for biosynthesis of histidine, valine, leucine, isoleucine, lysine and proline pre-determines the reliance on external sources of amino acids and hence the lifestyle of these organisms as scavengers of proteinaceous compounds from surrounding microbial community members. In contrast to earlier metagenomics-based assumptions, isolates were S-layer-deficient, non-motile, non-methylotrophic and devoid of iron-oxidation despite the abundance of methylotrophy substrates and ferrous iron in situ, which underlines the essentiality of experimental validation of bioinformatic predictions

    Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes.</p> <p>Results</p> <p>New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover ~88% of the genes in a genome compared to a ~76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; ~40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems.</p> <p>Conclusion</p> <p>The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: <url>ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/</url>.</p> <p>Reviewers</p> <p>This article was reviewed by Peer Bork, Patrick Forterre, and Purificacion Lopez-Garcia.</p

    Metagenomic analysis of the biodiversity and seasonal variation in the meromictic Antarctic lake, Ace Lake

    Full text link
    Ace lake is a stratified lake in the Vestfold Hills, Antarctica. The presence of a thick ice-cover for ~11 months of the year and a strong salinity gradient are responsible for its permanent stratification. Taxonomy analyses showed depth-based segregation of its microbial community, including viruses. Functional potential analyses of the lake taxa highlighted their roles in nutrient cycling. In this thesis, the seasonal changes in Ace Lake microbial community were studied using a time-series of metagenomes utilizing the Cavlab metagenome analysis pipeline. Statistical analyses of taxa abundance and environmental factors revealed the effects of the polar light cycle, with 24 hours of daylight in summer and no sunlight in winter, on the phototrophs identified in the lake, indicating the importance of light-based primary production in summer to prevail through the dark winter. Analysis of viral data generated from the metagenomes showed the presence of viruses, including a ‘huge phage’, throughout the lake, with a diverse population existing in the oxic zone. Analysis of virus-host associations of phototrophic bacteria revealed that the availability of light, rather than viral predation, was probably responsible for seasonal variations in host abundances. Genomic variation in Synechococcus and Chlorobium populations, analysed using metagenome-assembled genomes (MAGs) from Ace Lake, revealed phylotypes that highlighted their adaptation to the lake environment. Synechococcus phylotypes were linked to complex interaction with viruses, whereas some Chlorobium phylotypes were inferred to interact with Synechococcus. Some Chlorobium phylotypes were also inferred to have improved photosynthetic capacity, which might contribute to the very high abundance of this species in Ace Lake. Comparative genomic analysis of Chlorobium was performed using MAGs from Ace Lake, Ellis Fjord, and Taynaya Bay and the genome of a non-Antarctic Chlorobium phaeovibrioides. A single Chlorobium species, distinct from the non-Antarctic species, was prevalent in the oxycline of all three stratified systems, highlighting its endemicity to the Vestfold Hills. Potential Chlorobium viruses, representing generalist viruses, were identified in aquatic systems from the Vestfold Hills and the Rauer Islands, indicating a widespread geographic distribution. Seasonal variation in the Chlorobium population appeared to be caused by reliance on sunlight rather than the impact of viral predation, and was inferred to benefit the host by restricting the ability of specialist viruses to establish effective lifecycles. The findings in this thesis highlight the seasonal influence on Ace Lake biodiversity, the adaptations and potential interactions of the two key species Synechococcus and Chlorobium, and the endemicity of Ace Lake Chlorobium to the Vestfold Hills

    Application of Subspace Clustering in DNA Sequence Analysis

    Get PDF
    Identification and clustering of orthologous genes plays an important role in developing evolutionary models such as validating convergent and divergent phylogeny and predicting functional proteins in newly sequenced species of unverified nucleotide protein mappings. Here, we introduce an application of subspace clustering as applied to orthologous gene sequences and discuss the initial results. The working hypothesis is based upon the concept that genetic changes between nucleotide sequences coding for proteins among selected species and groups may lie within a union of subspaces for clusters of the orthologous groups. Estimates for the subspace dimensions were computed for a small population sample. A series of experiments was performed to cluster randomly selected sequences. The experimental design allows for both false positives and false negatives, and estimates for the statistical significance are provided. The clustering results are consistent with the main hypothesis. A simple random mutation binary tree model is used to simulate speciation events that show the interdependence of the subspace rank versus time and mutation rates. The simple mutation model is found to be largely consistent with the observed subspace clustering singular value results. Our study indicates that the subspace clustering method may be applied in orthology analysis

    ComSin: database of protein structures in bound (complex) and unbound (single) states in relation to their intrinsic disorder

    Get PDF
    Most of the proteins in a cell assemble into complexes to carry out their function. In this work, we have created a new database (named ComSin) of protein structures in bound (complex) and unbound (single) states to provide a researcher with exhaustive information on structures of the same or homologous proteins in bound and unbound states. From the complete Protein Data Bank (PDB), we selected 24 910 pairs of protein structures in bound and unbound states, and identified regions of intrinsic disorder. For 2448 pairs, the proteins in bound and unbound states are identical, while 7129 pairs have sequence identity 90% or larger. The developed server enables one to search for proteins in bound and unbound states with several options including sequence similarity between the corresponding proteins in bound and unbound states, and validation of interaction interfaces of protein complexes. Besides that, through our web server, one can obtain necessary information for studying disorder-to-order and order-to-disorder transitions upon complex formation, and analyze structural differences between proteins in bound and unbound states. The database is available at http://antares.protres.ru/comsin/

    Genetics of Halophilic Microorganisms

    Get PDF
    Halophilic microorganisms are found in all domains of life and thrive in hypersaline (high salt content) environments. These unusual microbes have been a subject of study for many years due to their interesting properties and physiology. Studies of the genetics of halophilic microorganisms (from gene expression and regulation to genomics) have provided understanding into the mechanisms of how life can exist at high salinity levels. Here, we highlight recent studies that advance the knowledge of biological function through examination of the genetics of halophilic microorganisms and their viruses

    DNA replication in growth conditions that mimic the natural habitat of Haloferax volcanii

    Get PDF
    The initial aim of the project was to assess origin-independent replication in Haloferax volcanii (Hfx. volcanii). DNA replication is initiated at specific sites on the chromosome called origins. Origins are assumed to be an essential feature of all cells, because they serve as binding sites for proteins that recruit the DNA replication machinery. In work published by Hawkins et al, (2013), it was demonstrated that mutants of Hfx. volcanii lacking all replication origins are viable; in fact, they grow faster than the wild-type and have no obvious cellular defects. By contrast, deletion of origins from Eukaryotes and Bacteria leads to cell death or profound growth defects. The question addressed in this project was whether the accelerated growth of Hfx. volcanii cells in the absence of replication originsis due to an artefact created by rich laboratory media conditions. This may explain why replication origins have not been eliminated by natural selection, as in the natural habitat of Hfx. volcanii, the wild-type strain would have an evolutionary advantage. To test this, a growth competition assay was modified to use fluorescent proteins and flow cytometry. It was predicted that in low nutrient media, the growth advantage of origin-deleted mutants will be minimised or eliminated, as these phenotypes are not witnessed in a natural environment. However, due to the outbreak of the COVID-19 pandemic, the project was altered to examine which factors are required for an organism to replicate without origins. A bioinformatic approach was chosen, adapting previously created tools to better fit a large data set and to predict the ability of 85 species to survive without origins. The bioinformatic pipeline involved a principal component analysis, which would take into account for any given species their respective nucleotide skew indices, spectral ratios, information gene linkage, co-orientation of core genes with DNA replication, and types of DNA polymerase genes located near origins. The results suggested several new candidate species for further experimentation and potential directions for improvement of the origin independent replication prediction tool
    corecore