22 research outputs found

    Disordered Patterns in Clustered Protein Data Bank and in Eukaryotic and Bacterial Proteomes

    Get PDF
    We have constructed the clustered Protein Data Bank and obtained clusters of chains of different identity inside each cluster, http://bioinfo.protres.ru/st_pdb/. We have compiled the largest database of disordered patterns (141) from the clustered PDB where identity between chains inside of a cluster is larger or equal to 75% (version of 28 June 2010) by using simple rules of selection. The results of these analyses would help to further our understanding of the physicochemical and structural determinants of intrinsically disordered regions that serve as molecular recognition elements. We have analyzed the occurrence of the selected patterns in 97 eukaryotic and in 26 bacterial proteomes. The disordered patterns appear more often in eukaryotic than in bacterial proteomes. The matrix of correlation coefficients between numbers of proteins where a disordered pattern from the library of 141 disordered patterns appears at least once in 9 kingdoms of eukaryota and 5 phyla of bacteria have been calculated. As a rule, the correlation coefficients are higher inside of the considered kingdom than between them. The patterns with the frequent occurrence in proteomes have low complexity (PPPPP, GGGGG, EEEED, HHHH, KKKKK, SSTSS, QQQQQP), and the type of patterns vary across different proteomes, http://bioinfo.protres.ru/fp/search_new_pattern.html

    NNAlign: A Web-Based Prediction Method Allowing Non-Expert End-User Discovery of Sequence Motifs in Quantitative Peptide Data

    Get PDF
    Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points

    The impact of focused Gene Ontology curation of specific mammalian systems.

    Get PDF
    The Gene Ontology (GO) resource provides dynamic controlled vocabularies to provide an information-rich resource to aid in the consistent description of the functional attributes and subcellular locations of gene products from all taxonomic groups (www.geneontology.org). System-focused projects, such as the Renal and Cardiovascular GO Annotation Initiatives, aim to provide detailed GO data for proteins implicated in specific organ development and function. Such projects support the rapid evaluation of new experimental data and aid in the generation of novel biological insights to help alleviate human disease. This paper describes the improvement of GO data for renal and cardiovascular research communities and demonstrates that the cardiovascular-focused GO annotations, created over the past three years, have led to an evident improvement of microarray interpretation. The reanalysis of cardiovascular microarray datasets confirms the need to continue to improve the annotation of the human proteome. AVAILABILITY: GO ANNOTATION DATA IS FREELY AVAILABLE FROM: ftp://ftp.geneontology.org/pub/go/gene-associations

    vProtein: Identifying Optimal Amino Acid Complements from Plant-Based Foods

    Get PDF
    Background: Indispensible amino acids (IAAs) are used by the body in different proportions. Most animal-based foods provide these IAAs in roughly the needed proportions, but many plant-based foods provide different proportions of IAAs. To explore how these plant-based foods can be better used in human nutrition, we have created the computational tool vProtein to identify optimal food complements to satisfy human protein needs. Methods: vProtein uses 1251 plant-based foods listed in the United States Department of Agriculture standard release 22 database to determine the quantity of each food or pair of foods required to satisfy human IAA needs as determined by the 2005 daily recommended intake. The quantity of food in a pair is found using a linear programming approach that minimizes total calories, total excess IAAs, or the total weight of the combination. Results: For single foods, vProtein identifies foods with particularly balanced IAA patterns such as wheat germ, quinoa, and cauliflower. vProtein also identifies foods with particularly unbalanced IAA patterns such as macadamia nuts, degermed corn products, and wakame seaweed. Although less useful alone, some unbalanced foods provide unusually good complements, such as Brazil nuts to legumes. Interestingly, vProtein finds no statistically significant bias toward grain/ legume pairings for protein complementation. These analyses suggest that pairings of plant-based foods should be based on the individual foods themselves instead of based on broader food group-food group pairings. Overall, the most efficien

    VennPlex--a novel Venn diagram program for comparing and visualizing datasets with differentially regulated datapoints.

    Get PDF
    With the development of increasingly large and complex genomic and proteomic data sets, an enhancement in the complexity of available Venn diagram analytical programs is becoming increasingly important. Current freely available Venn diagram programs often fail to represent extra complexity among datasets, such as regulation pattern differences between different groups. Here we describe the development of VennPlex, a program that illustrates the often diverse numerical interactions among multiple, high-complexity datasets, using up to four data sets. VennPlex includes versatile output features, where grouped data points in specific regions can be easily exported into a spreadsheet. This program is able to facilitate the analysis of two to four gene sets and their corresponding expression values in a user-friendly manner. To demonstrate its unique experimental utility we applied VennPlex to a complex paradigm, i.e. a comparison of the effect of multiple oxygen tension environments (1–20% ambient oxygen) upon gene transcription of primary rat astrocytes. VennPlex accurately dissects complex data sets reliably into easily identifiable groups for straightforward analysis and data output. This program, which is an improvement over currently available Venn diagram programs, is able to rapidly extract important datasets that represent the variety of expression patterns available within the data sets, showing potential applications in fields like genomics, proteomics, and bioinformatics

    A Score of the Ability of a Three-Dimensional Protein Model to Retrieve Its Own Sequence as a Quantitative Measure of Its Quality and Appropriateness

    Get PDF
    BACKGROUND: Despite the remarkable progress of bioinformatics, how the primary structure of a protein leads to a three-dimensional fold, and in turn determines its function remains an elusive question. Alignments of sequences with known function can be used to identify proteins with the same or similar function with high success. However, identification of function-related and structure-related amino acid positions is only possible after a detailed study of every protein. Folding pattern diversity seems to be much narrower than sequence diversity, and the amino acid sequences of natural proteins have evolved under a selective pressure comprising structural and functional requirements acting in parallel. PRINCIPAL FINDINGS: The approach described in this work begins by generating a large number of amino acid sequences using ROSETTA [Dantas G et al. (2003) J Mol Biol 332:449-460], a program with notable robustness in the assignment of amino acids to a known three-dimensional structure. The resulting sequence-sets showed no conservation of amino acids at active sites, or protein-protein interfaces. Hidden Markov models built from the resulting sequence sets were used to search sequence databases. Surprisingly, the models retrieved from the database sequences belonged to proteins with the same or a very similar function. Given an appropriate cutoff, the rate of false positives was zero. According to our results, this protocol, here referred to as Rd.HMM, detects fine structural details on the folding patterns, that seem to be tightly linked to the fitness of a structural framework for a specific biological function. CONCLUSION: Because the sequence of the native protein used to create the Rd.HMM model was always amongst the top hits, the procedure is a reliable tool to score, very accurately, the quality and appropriateness of computer-modeled 3D-structures, without the need for spectroscopy data. However, Rd.HMM is very sensitive to the conformational features of the models' backbone

    NR-2L: A Two-Level Predictor for Identifying Nuclear Receptor Subfamilies Based on Sequence-Derived Features

    Get PDF
    Nuclear receptors (NRs) are one of the most abundant classes of transcriptional regulators in animals. They regulate diverse functions, such as homeostasis, reproduction, development and metabolism. Therefore, NRs are a very important target for drug development. Nuclear receptors form a superfamily of phylogenetically related proteins and have been subdivided into different subfamilies due to their domain diversity. In this study, a two-level predictor, called NR-2L, was developed that can be used to identify a query protein as a nuclear receptor or not based on its sequence information alone; if it is, the prediction will be automatically continued to further identify it among the following seven subfamilies: (1) thyroid hormone like (NR1), (2) HNF4-like (NR2), (3) estrogen like, (4) nerve growth factor IB-like (NR4), (5) fushi tarazu-F1 like (NR5), (6) germ cell nuclear factor like (NR6), and (7) knirps like (NR0). The identification was made by the Fuzzy K nearest neighbor (FK-NN) classifier based on the pseudo amino acid composition formed by incorporating various physicochemical and statistical features derived from the protein sequences, such as amino acid composition, dipeptide composition, complexity factor, and low-frequency Fourier spectrum components. As a demonstration, it was shown through some benchmark datasets derived from the NucleaRDB and UniProt with low redundancy that the overall success rates achieved by the jackknife test were about 93% and 89% in the first and second level, respectively. The high success rates indicate that the novel two-level predictor can be a useful vehicle for identifying NRs and their subfamilies. As a user-friendly web server, NR-2L is freely accessible at either http://icpr.jci.edu.cn/bioinfo/NR2L or http://www.jci-bioinfo.cn/NR2L. Each job submitted to NR-2L can contain up to 500 query protein sequences and be finished in less than 2 minutes. The less the number of query proteins is, the shorter the time will usually be. All the program codes for NR-2L are available for non-commercial purpose upon request

    GPS-ARM: Computational Analysis of the APC/C Recognition Motif by Predicting D-Boxes and KEN-Boxes

    Get PDF
    Anaphase-promoting complex/cyclosome (APC/C), an E3 ubiquitin ligase incorporated with Cdh1 and/or Cdc20 recognizes and interacts with specific substrates, and faithfully orchestrates the proper cell cycle events by targeting proteins for proteasomal degradation. Experimental identification of APC/C substrates is largely dependent on the discovery of APC/C recognition motifs, e.g., the D-box and KEN-box. Although a number of either stringent or loosely defined motifs proposed, these motif patterns are only of limited use due to their insufficient powers of prediction. We report the development of a novel GPS-ARM software package which is useful for the prediction of D-boxes and KEN-boxes in proteins. Using experimentally identified D-boxes and KEN-boxes as the training data sets, a previously developed GPS (Group-based Prediction System) algorithm was adopted. By extensive evaluation and comparison, the GPS-ARM performance was found to be much better than the one using simple motifs. With this powerful tool, we predicted 4,841 potential D-boxes in 3,832 proteins and 1,632 potential KEN-boxes in 1,403 proteins from H. sapiens, while further statistical analysis suggested that both the D-box and KEN-box proteins are involved in a broad spectrum of biological processes beyond the cell cycle. In addition, with the co-localization information, we predicted hundreds of mitosis-specific APC/C substrates with high confidence. As the first computational tool for the prediction of APC/C-mediated degradation, GPS-ARM is a useful tool for information to be used in further experimental investigations. The GPS-ARM is freely accessible for academic researchers at: http://arm.biocuckoo.org

    Top-Level Categories of Constitutively Organized Material Entities - Suggestions for a Formal Top-Level Ontology

    Get PDF
    Application oriented ontologies are important for reliably communicating and managing data in databases. Unfortunately, they often differ in the definitions they use and thus do not live up to their potential. This problem can be reduced when using a standardized and ontologically consistent template for the top-level categories from a top-level formal foundational ontology. This would support ontological consistency within application oriented ontologies and compatibility between them. The Basic Formal Ontology (BFO) is such a foundational ontology for the biomedical domain that has been developed following the single inheritance policy. It provides the top-level template within the Open Biological and Biomedical Ontologies Foundry. If it wants to live up to its expected role, its three top-level categories of material entity (i.e., 'object', 'fiat object part', 'object aggregate') must be exhaustive, i.e. every concrete material entity must instantiate exactly one of them.By systematically evaluating all possible basic configurations of material building blocks we show that BFO's top-level categories of material entity are not exhaustive. We provide examples from biology and everyday life that demonstrate the necessity for two additional categories: 'fiat object part aggregate' and 'object with fiat object part aggregate'. By distinguishing topological coherence, topological adherence, and metric proximity we furthermore provide a differentiation of clusters and groups as two distinct subcategories for each of the three categories of material entity aggregates, resulting in six additional subcategories of material entity.We suggest extending BFO to incorporate two additional categories of material entity as well as two subcategories for each of the three categories of material entity aggregates. With these additions, BFO would exhaustively cover all top-level types of material entity that application oriented ontologies may use as templates. Our result, however, depends on the premise that all material entities are organized according to a constitutive granularity

    GOPred: GO Molecular Function Prediction by Combined Classifiers

    Get PDF
    Functional protein annotation is an important matter for in vivo and in silico biology. Several computational methods have been proposed that make use of a wide range of features such as motifs, domains, homology, structure and physicochemical properties. There is no single method that performs best in all functional classification problems because information obtained using any of these features depends on the function to be assigned to the protein. In this study, we portray a novel approach that combines different methods to better represent protein function. First, we formulated the function annotation problem as a classification problem defined on 300 different Gene Ontology (GO) terms from molecular function aspect. We presented a method to form positive and negative training examples while taking into account the directed acyclic graph (DAG) structure and evidence codes of GO. We applied three different methods and their combinations. Results show that combining different methods improves prediction accuracy in most cases. The proposed method, GOPred, is available as an online computational annotation tool (http://kinaz.fen.bilkent.edu.tr/gopred)
    corecore