6 research outputs found

    Data mining of enzymes using specific peptides

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is.</p> <p>Results</p> <p>We extract novel SP sets from Swiss-Prot enzyme data. Using a training set of July 2006, and test sets of July 2008, we find that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length of all SP matches (the number of amino-acids matched on the protein sequence). DME is quite different from BLAST. Comparing the two on an enzyme test set of July 2008, we find that DME has lower recall. On the other hand, DME can provide predictions for proteins regarded by BLAST as having low homologies with known enzymes, thus supplying complementary information. We test our method on a set of proteins belonging to 10 bacteria, dated July 2008, establishing the usefulness of the coverage-length cutoff to determine true-negatives. Moreover, sifting through our predictions we find that some of them have been substantiated by Swiss-Prot annotations by July 2009. Finally we extract, for production purposes, a novel SP set trained on all Swiss-Prot enzymes as of July 2009. This new set increases considerably the recall of DME. The new SP set is being applied to three metagenomes: Sargasso Sea with over 1,000,000 proteins, producing predictions of over 220,000 enzymes, and two human gut metagenomes. The outcome of these analyses can be characterized by the enzymatic profile of the metagenomes, describing the relative numbers of enzymes observed for different EC categories.</p> <p>Conclusions</p> <p>Employing SPs for predicting enzymatic activity of proteins works well once one utilizes coverage-length criteria. In our analysis, L ≥ 7 has led to highly accurate results.</p

    Deriving enzymatic and taxonomic signatures of metagenomes from short read data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We propose a method for deriving enzymatic signatures from short read metagenomic data of unknown species. The short read data are converted to six pseudo-peptide candidates. We search for occurrences of Specific Peptides (SPs) on the latter. SPs are peptides that are indicative of enzymatic function as defined by the Enzyme Commission (EC) nomenclature. The number of SP hits on an ensemble of short reads is counted and then converted to estimates of numbers of enzymatic genes associated with different EC categories in the studied metagenome. Relative amounts of different EC categories define the enzymatic spectrum, without the need to perform genomic assemblies of short reads.</p> <p>Results</p> <p>The method is developed and tested on 22 bacteria for which there exist many EC annotations in Uniprot. Enzymatic signatures are derived for 3 metagenomes, and their functional profiles are explored.</p> <p>We extend the SP methodology to taxon-specific SPs (TSPs), allowing us to estimate taxonomic features of metagenomic data from short reads. Using recent Swiss-Prot data we obtain TSPs for different phyla of bacteria, and different classes of proteobacteria. These allow us to analyze the major taxonomic content of 4 different metagenomic data-sets.</p> <p>Conclusions</p> <p>The SP methodology can be successfully extended to applications on short read genomic and metagenomic data. This leads to direct derivation of enzymatic signatures from raw short reads. Furthermore, by employing TSPs, one obtains valuable taxonomic information.</p

    Bioinfo_eXtrema : un enfoque bioinformático para integrar información ambiental, bioquímica y genómica, enfocado en bioprospección y selección de consorcios de microorganismos con aplicaciones en biorremediación

    Get PDF
    La identificación de componentes funcionales clave para diversos bioprocesos de interés industrial ha permitido seleccionar aislamientos adaptados a condiciones ambientales extremas en tres especies de hongos del género Penicillium. Dichos aislamientos fueron evaluados in vitro para caracterizar su potencial como componentes de un consorcio microbiano aplicable en biorremediación de efluentes industriales que contienen residuos lignocelulósicos. Los resultados de la anotación de secuencias genómicas disponibles para una de las especies identificadas apuntan a la existencia de genes con alta similaridad respecto a los existentes en diversos hongos considerados como referencia en materia de degradación de lignina en ambientes naturales. Las anotaciones funcionales propuestas a partir de secuencias accesibles –identificadas a través de la base de datos Fungal Oxidative Lignin Enzymes– podrían contrastarse con los resultados experimentales para cepas creciendo en diferentes medios con lignina, representando ambientes industriales extremos. Mediante este trabajo se propone el ensamblado de Bioinfo_eXtrema como parte de un enfoque bioinformático centrado en la selección de consorcios de extremófilos para aplicaciones en biotecnología industrial, combinando diversas técnicas de minería de datos –integradas a través del Waikato Environment for Knowledge Analysis– para facilitar la integración de información molecular disponible e indicadores funcionales relevantes para aplicaciones en biorremediación

    Peptide markers of aminoacyl tRNA synthetases facilitate taxa counting in metagenomic data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Taxa counting is a major problem faced by analysis of metagenomic data. The most popular method relies on analysis of 16S rRNA sequences, but some studies employ also protein based analyses. It would be advantageous to have a method that is applicable directly to short sequences, of the kind extracted from samples in modern metagenomic research. This is achieved by the technique proposed here.</p> <p>Results</p> <p>We employ specific peptides, deduced from aminoacyl tRNA synthetases, as markers for the occurrence of single genes in data. Sequences carrying these markers are aligned and compared with each other to provide a lower limit for taxa counts in metagenomic data. The method is compared with 16S rRNA searches on a set of known genomes. The taxa counting problem is analyzed mathematically and a heuristic algorithm is proposed. When applied to genomic contigs of a recent human gut microbiome study, the taxa counting method provides information on numbers of different species and strains. We then apply our method to short read data and demonstrate how it can be calibrated to cope with errors. Comparison to known databases leads to estimates of the percentage of novelties, and the type of phyla involved.</p> <p>Conclusions</p> <p>A major advantage of our method is its simplicity: it relies on searching sequences for the occurrence of just 4000 specific peptides belonging to the S61 subgroup of aaRS enzymes. When compared to other methods, it provides additional insight into the taxonomic contents of metagenomic data. Furthermore, it can be directly applied to short read data, avoiding the need for genomic contig reconstruction, and taking into account short reads that are otherwise discarded as singletons. Hence it is very suitable for a fast analysis of next generation sequencing data.</p

    Pharmacolgical and biological annotations enhance functional residues prediction

    Full text link
    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Facultad de Ciencias, Departamento de Biología Molecular. Fecha de lectura: 15-09-201
    corecore