219 research outputs found

    GASdb: a large-scale and comparative exploration database of glycosyl hydrolysis systems

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The genomes of numerous cellulolytic organisms have been recently sequenced or in the pipeline of being sequenced. Analyses of these genomes as well as the recently sequenced metagenomes in a systematic manner could possibly lead to discoveries of novel biomass-degradation systems in nature.</p> <p>Description</p> <p>We have identified 4,679 and 49,099 free acting glycosyl hydrolases with or without carbohydrate binding domains, respectively, by scanning through all the proteins in the UniProt Knowledgebase and the JGI Metagenome database. Cellulosome components were observed only in bacterial genomes, and 166 cellulosome-dependent glycosyl hydrolases were identified. We observed, from our analysis data, unexpected wide distributions of two less well-studied bacterial glycosyl hydrolysis systems in which glycosyl hydrolases may bind to the cell surface directly rather than through linking to surface anchoring proteins, or cellulosome complexes may bind to the cell surface by novel mechanisms other than the other used SLH domains. In addition, we found that animal-gut metagenomes are substantially enriched with novel glycosyl hydrolases.</p> <p>Conclusions</p> <p>The identified biomass degradation systems through our large-scale search are organized into an easy-to-use database GASdb at <url>http://csbl.bmb.uga.edu/~ffzhou/GASdb/</url>, which should be useful to both experimental and computational biofuel researchers.</p

    Barcodes for genomes and applications

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Each genome has a stable distribution of the combined frequency for each <it>k</it>-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1<k<6. The collection of these <it>k</it>-mer frequency distributions is unique to each genome and termed the genome's <it>barcode</it>.</p> <p>Results</p> <p>We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.</p> <p>Conclusion</p> <p>These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.</p

    Insertion Sequences show diverse recent activities in Cyanobacteria and Archaea

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Mobile genetic elements (MGEs) play an essential role in genome rearrangement and evolution, and are widely used as an important genetic tool.</p> <p>Results</p> <p>In this article, we present genetic maps of recently active <it>Insertion Sequence </it>(IS) elements, the simplest form of MGEs, for all sequenced cyanobacteria and archaea, predicted based on the previously identified ~1,500 IS elements. Our predicted IS maps are consistent with the NCBI annotations of the IS elements. By linking the predicted IS elements to various characteristics of the organisms under study and the organism's living conditions, we found that (a) the activities of IS elements heavily depend on the environments where the host organisms live; (b) the number of recently active IS elements in a genome tends to increase with the genome size; (c) the flanking regions of the recently active IS elements are significantly enriched with genes encoding DNA binding factors, transporters and enzymes; and (d) IS movements show no tendency to disrupt operonic structures.</p> <p>Conclusion</p> <p>This is the first genome-scale maps of IS elements with detailed structural information on the sequence level. These genetic maps of recently active IS elements and the several interesting observations would help to improve our understanding of how IS elements proliferate and how they are involved in the evolution of the host genomes.</p

    SUMOsp: a web server for sumoylation site prediction

    Get PDF
    Systematic dissection of the sumoylation proteome is emerging as an appealing but challenging research topic because of the significant roles sumoylation plays in cellular dynamics and plasticity. Although several proteome-scale analyzes have been performed to delineate potential sumoylatable proteins, the bona fide sumoylation sites still remain to be identified. Previously, we carried out a genome-wide analysis of the SUMO substrates in human nucleus using the putative motif ψ-K-X-E and evolutionary conservation. However, a highly specific predictor for in silico prediction of sumoylation sites in any individual organism is still urgently needed to guide experimental design. In this work, we present a computational system SUMOsp—SUMOylation Sites Prediction, based on a manually curated dataset, integrating the results of two methods, GPS and MotifX, which were originally designed for phosphorylation site prediction. SUMOsp offers at least as good prediction performance as the only available method, SUMOplot, on a very large test set. We expect that the prediction results of SUMOsp combined with experimental verifications will propel our understanding of sumoylation mechanisms to a new level. SUMOsp has been implemented on a freely accessible web server at:

    Age Is Important for the Early-Stage Detection of Breast Cancer on Both Transcriptomic and Methylomic Biomarkers

    Get PDF
    Patients at different ages have different rates of cell development and metabolisms. As a result, age should be an essential part of how a disease diagnosis model is trained and optimized. Unfortunately, most of the existing studies have not taken age into account. This study demonstrated that disease diagnosis models could be improved by merely applying individual models for patients of different age groups. Both transcriptomes and methylomes of the TCGA breast cancer dataset (TCGA-BRCA) were utilized for the analysis procedure of feature selection and classification. Our experimental data strongly suggested that disease diagnosis modeling should integrate patient age into the whole experimental design

    GPS: a comprehensive www server for phosphorylation sites prediction

    Get PDF
    Protein phosphorylation plays a fundamental role in most of the cellular regulatory pathways. Experimental identification of protein kinases' (PKs) substrates with their phosphorylation sites is labor-intensive and often limited by the availability and optimization of enzymatic reactions. Recently, large-scale analysis of the phosphoproteome by the mass spectrometry (MS) has become a popular approach. But experimentally, it is still difficult to distinguish the kinase-specific sites on the substrates. In this regard, the in silico prediction of phosphorylation sites with their specific kinases using protein's primary sequences may provide guidelines for further experimental consideration and interpretation of MS phosphoproteomic data. A variety of such tools exists over the Internet and provides the predictions for at most 30 PK subfamilies. We downloaded the verified phosphorylation sites from the public databases and curated the literature extensively for recently found phosphorylation sites. With the hypothesis that PKs in the same subfamily share similar consensus sequences/motifs/functional patterns on substrates, we clustered the 216 unique PKs in 71 PK groups, according to the BLAST results and protein annotations. Then, we applied the group-based phosphorylation scoring (GPS) method on the data set; here, we present a comprehensive PK-specific prediction server GPS, which could predict kinase-specific phosphorylation sites from protein primary sequences for 71 different PK groups. GPS has been implemented in PHP and is available on a www server at

    cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data

    Get PDF
    Summary: Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves ∼92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy
    corecore