863 research outputs found

    Biases in the Experimental Annotations of Protein Function and their Effect on Our Understanding of Protein Function Space

    Get PDF
    The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here we investigate just how prevalent is the "few articles -- many proteins" phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments.Comment: Accepted to PLoS Computational Biology. Press embargo applies. v4: text corrected for style and supplementary material inserte

    VIRGO: computational prediction of gene functions

    Get PDF
    Dramatic advances in sequencing technology and sophisticated experimental assays that interrogate the cell, combined with the public availability of the resulting data, herald the era of systems biology. However, the biological functions of more than 40% of the genes in sequenced genomes are unknown, posing a fundamental barrier to progress in systems biology. The large scale and diversity of available data requires the development of techniques that can automatically utilize these datasets to make quantified and robust predictions of gene function that can be experimentally verified. We present a service called the VIRtual Gene Ontology (VIRGO) that (i) constructs a functional linkage network (FLN) from gene expression and molecular interaction data, (ii) labels genes in the FLN with their functional annotations in the Gene Ontology and (iii) systematically propagates these labels across the FLN in order to precisely predict the functions of unlabelled genes. VIRGO assigns confidence estimates to predicted functions so that a biologist can prioritize predictions for further experimental study. For each prediction, VIRGO also provides an informative ‘propagation diagram’ that traces the flow of information in the FLN that led to the prediction. VIRGO is available at

    GRIP: A web-based system for constructing Gold Standard datasets for protein-protein interaction prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Information about protein interaction networks is fundamental to understanding protein function and cellular processes. Interaction patterns among proteins can suggest new drug targets and aid in the design of new therapeutic interventions. Efforts have been made to map interactions on a proteomic-wide scale using both experimental and computational techniques. Reference datasets that contain known interacting proteins (positive cases) and non-interacting proteins (negative cases) are essential to support computational prediction and validation of protein-protein interactions. Information on known interacting and non interacting proteins are usually stored within databases. Extraction of these data can be both complex and time consuming. Although, the automatic construction of reference datasets for classification is a useful resource for researchers no public resource currently exists to perform this task.</p> <p>Results</p> <p>GRIP (Gold Reference dataset constructor from Information on Protein complexes) is a web-based system that provides researchers with the functionality to create reference datasets for protein-protein interaction prediction in <it>Saccharomyces cerevisiae</it>. Both positive and negative cases for a reference dataset can be extracted, organised and downloaded by the user. GRIP also provides an upload facility whereby users can submit proteins to determine protein complex membership. A search facility is provided where a user can search for protein complex information in <it>Saccharomyces cerevisiae</it>.</p> <p>Conclusion</p> <p>GRIP is developed to retrieve information on protein complex, cellular localisation, and physical and genetic interactions in <it>Saccharomyces cerevisiae</it>. Manual construction of reference datasets can be a time consuming process requiring programming knowledge. GRIP simplifies and speeds up this process by allowing users to automatically construct reference datasets. GRIP is free to access at <url>http://rosalind.infj.ulst.ac.uk/GRIP/</url>.</p

    A genome-wide screen identifies genes that suppress the accumulation of spontaneous mutations in young and aged yeast cells

    Get PDF
    To ensure proper transmission of genetic information, cells need to preserve and faithfully replicate their genome, and failure to do so leads to genome instability, a hallmark of both cancer and aging. Defects in genes involved in guarding genome stability cause several human progeroid syndromes, and an age-dependent accumulation of mutations has been observed in different organisms, from yeast to mammals. However, it is unclear whether the spontaneous mutation rate changes during aging and whether specific pathways are important for genome maintenance in old cells. We developed a high-throughput replica-pinning approach to screen for genes important to suppress the accumulation of spontaneous mutations during yeast replicative aging. We found 13 known mutation suppression genes, and 31 genes that had no previous link to spontaneous mutagenesis, and all acted independently of age. Importantly, we identified PEX19, encoding an evolutionarily conserved peroxisome biogenesis factor, as an age-specific mutation suppression gene. While wild-type and pex19Δ young cells have similar spontaneous mutation rates, aged cells lacking PEX19 display an elevated mutation rate. This finding suggests that functional peroxisomes may be important to preserve genome integrity specifically in old cells

    Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants

    Get PDF
    Conserved noncoding sequences (CNSs) in DNA are reliable pointers to regulatory elements controlling gene expression. Using a comparative genomics approach with four dicotyledonous plant species (Arabidopsis thaliana, papaya [Carica papaya], poplar [Populus trichocarpa], and grape [Vitis vinifera]), we detected hundreds of CNSs upstream of Arabidopsis genes. Distinct positioning, length, and enrichment for transcription factor binding sites suggest these CNSs play a functional role in transcriptional regulation. The enrichment of transcription factors within the set of genes associated with CNS is consistent with the hypothesis that together they form part of a conserved transcriptional network whose function is to regulate other transcription factors and control development. We identified a set of promoters where regulatory mechanisms are likely to be shared between the model organism Arabidopsis and other dicots, providing areas of focus for further research

    Alliance of Genome Resources Portal: unified model organism research platform

    Get PDF
    The Alliance of Genome Resources (Alliance) is a consortium of the major model organism databases and the Gene Ontology that is guided by the vision of facilitating exploration of related genes in human and well-studied model organisms by providing a highly integrated and comprehensive platform that enables researchers to leverage the extensive body of genetic and genomic studies in these organisms. Initiated in 2016, the Alliance is building a central portal (www.alliancegenome.org) for access to data for the primary model organisms along with gene ontology data and human data. All data types represented in the Alliance portal (e.g. genomic data and phenotype descriptions) have common data models and workflows for curation. All data are open and freely available via a variety of mechanisms. Long-term plans for the Alliance project include a focus on coverage of additional model organisms including those without dedicated curation communities, and the inclusion of new data types with a particular focus on providing data and tools for the non-model-organism researcher that support enhanced discovery about human health and disease. Here we review current progress and present immediate plans for this new bioinformatics resource

    Direct and Absolute Quantification of over 1800 Yeast Proteins via Selected Reaction Monitoring

    Get PDF
    Defining intracellular protein concentration is critical in molecular systems biology. Although strategies for determining relative protein changes are available, defining robust absolute values in copies per cell has proven significantly more challenging. Here we present a reference data set quantifying over 1800 Saccharomyces cerevisiae proteins by direct means using protein-specific stable-isotope labeled internal standards and selected reaction monitoring (SRM) mass spectrometry, far exceeding any previous study. This was achieved by careful design of over 100 QconCAT recombinant proteins as standards, defining 1167 proteins in terms of copies per cell and upper limits on a further 668, with robust CVs routinely less than 20%. The selected reaction monitoring-derived proteome is compared with existing quantitative data sets, highlighting the disparities between methodologies. Coupled with a quantification of the transcriptome by RNA-seq taken from the same cells, these data support revised estimates of several fundamental molecular parameters: a total protein count of ∼100 million molecules-per-cell, a median of ∼1000 proteins-per-transcript, and a linear model of protein translation explaining 70% of the variance in translation rate. This work contributes a “gold-standard” reference yeast proteome (including 532 values based on high quality, dual peptide quantification) that can be widely used in systems models and for other comparative studies. Reliable and accurate quantification of the proteins present in a cell or tissue remains a major challenge for post-genome scientists. Proteins are the primary functional molecules in biological systems and knowledge of their abundance and dynamics is an important prerequisite to a complete understanding of natural physiological processes, or dysfunction in disease. Accordingly, much effort has been spent in the development of reliable, accurate and sensitive techniques to quantify the cellular proteome, the complement of proteins expressed at a given time under defined conditions (1). Moreover, the ability to model a biological system and thus characterize it in kinetic terms, requires that protein concentrations be defined in absolute numbers (2, 3). Given the high demand for accurate quantitative proteome data sets, there has been a continual drive to develop methodology to accomplish this, typically using mass spectrometry (MS) as the analytical platform. Many recent studies have highlighted the capabilities of MS to provide good coverage of the proteome at high sensitivity often using yeast as a demonstrator system (4⇓⇓⇓⇓⇓–10), suggesting that quantitative proteomics has now “come of age” (1). However, given that MS is not inherently quantitative, most of the approaches produce relative quantitation and do not typically measure the absolute concentrations of individual molecular species by direct means. For the yeast proteome, epitope tagging studies using green fluorescent protein or tandem affinity purification tags provides an alternative to MS. Here, collections of modified strains are generated that incorporate a detectable, and therefore quantifiable, tag that supports immunoblotting or fluorescence techniques (11, 12). However, such strategies for copies per cell (cpc) quantification rely on genetic manipulation of the host organism and hence do not quantify endogenous, unmodified protein. Similarly, the tagging can alter protein levels - in some instances hindering protein expression completely (11). Even so, epitope tagging methods have been of value to the community, yielding high coverage quantitative data sets for the majority of the yeast proteome (11, 12). MS-based methods do not rely on such nonendogenous labels, and can reach genome-wide levels of coverage. Accurate estimation of absolute concentrations i.e. protein copy number per cell, also usually necessitates the use of (one or more) external or internal standards from which to derive absolute abundance (4). Examples include a comprehensive quantification of the Leptospira interrogans proteome that used a 19 protein subset quantified using selected reaction monitoring (SRM)1 to calibrate their label-free data (8, 13). It is worth noting that epitope tagging methods, although also absolute, rely on a very limited set of standards for the quantitative western blots and necessitate incorporation of a suitable immunogenic tag (11). Other recent, innovative approaches exploiting total ion signal and internal scaling to estimate protein cellular abundance (10, 14), avoid the use of internal standards, though they do rely on targeted proteomic data to validate their approach. The use of targeted SRM strategies to derive proteomic calibration standards highlights its advantages in comparison to label-free in terms of accuracy, precision, dynamic range and limit of detection and has gained currency for its reliability and sensitivity (3, 15⇓–17). Indeed, SRM is often referred to as the “gold standard proteomic quantification method,” being particularly well-suited when the proteins to be quantified are known, when appropriate surrogate peptides for protein quantification can be selected a priori, and matched with stable isotope-labeled (SIL) standards (18⇓–20). In combination with SIL peptide standards that can be generated through a variety of means (3, 15), SRM can be used to quantify low copy number proteins, reaching down to ∼50 cpc in yeast (5). However, although SRM methodology has been used extensively for S. cerevisiae protein quantification by us and others (19, 21, 22), it has not been used for large protein cohorts because of the requirement to generate the large numbers of attendant SIL peptide standards; the largest published data set is only for a few tens of proteins. It remains a challenge therefore to robustly quantify an entire eukaryotic proteome in absolute terms by direct means using targeted MS and this is the focus of our present study, the Census Of the Proteome of Yeast (CoPY). We present here direct and absolute quantification of nearly 2000 endogenous proteins from S. cerevisiae grown in steady state in a chemostat culture, using the SRM-based QconCAT approach. Although arguably not quantification of the entire proteome, this represents an accurate and rigorous collection of direct yeast protein quantifications, providing a gold-standard data set of endogenous protein levels for future reference and comparative studies. The highly reproducible SIL-SRM MS data, with robust CVs typically less than 20%, is compared with other extant data sets that were obtained via alternative analytical strategies. We also report a matched high quality transcriptome from the same cells using RNA-seq, which supports additional calculations including a refined estimate of the total protein content in yeast cells, and a simple linear model of translation explaining 70% of the variance between RNA and protein levels in yeast chemostat cultures. These analyses confirm the validity of our data and approach, which we believe represents a state-of-the-art absolute quantification compendium of a significant proportion of a model eukaryotic proteome

    Alliance of Genome Resources Portal: unified model organism research platform

    Get PDF
    The Alliance of Genome Resources (Alliance) is a consortium of the major model organism databases and the Gene Ontology that is guided by the vision of facilitating exploration of related genes in human and well-studied model organisms by providing a highly integrated and comprehensive platform that enables researchers to leverage the extensive body of genetic and genomic studies in these organisms. Initiated in 2016, the Alliance is building a central portal (www.alliancegenome.org) for access to data for the primary model organisms along with gene ontology data and human data. All data types represented in the Alliance portal (e.g. genomic data and phenotype descriptions) have common data models and workflows for curation. All data are open and freely available via a variety of mechanisms. Long-term plans for the Alliance project include a focus on coverage of additional model organisms including those without dedicated curation communities, and the inclusion of new data types with a particular focus on providing data and tools for the non-model-organism researcher that support enhanced discovery about human health and disease. Here we review current progress and present immediate plans for this new bioinformatics resource

    Validation and refinement of gene-regulatory pathways on a network of physical interactions

    Get PDF
    As genome-scale measurements lead to increasingly complex models of gene regulation, systematic approaches are needed to validate and refine these models. Towards this goal, we describe an automated procedure for prioritizing genetic perturbations in order to discriminate optimally between alternative models of a gene-regulatory network. Using this procedure, we evaluate 38 candidate regulatory networks in yeast and perform four high-priority gene knockout experiments. The refined networks support previously unknown regulatory mechanisms downstream of SOK2 and SWI4

    Expansion of the human mitochondrial proteome by intra- and inter-compartmental protein duplication

    Get PDF
    The human mitochondrial proteome is shown to have expanded due to duplication of protein encoding genes and re-localization of these duplicated proteins
    corecore