14,061 research outputs found
Biases in the Experimental Annotations of Protein Function and their Effect on Our Understanding of Protein Function Space
The ongoing functional annotation of proteins relies upon the work of
curators to capture experimental findings from scientific literature and apply
them to protein sequence and structure data. However, with the increasing use
of high-throughput experimental assays, a small number of experimental studies
dominate the functional protein annotations collected in databases. Here we
investigate just how prevalent is the "few articles -- many proteins"
phenomenon. We examine the experimentally validated annotation of proteins
provided by several groups in the GO Consortium, and show that the distribution
of proteins per published study is exponential, with 0.14% of articles
providing the source of annotations for 25% of the proteins in the UniProt-GOA
compilation. Since each of the dominant articles describes the use of an assay
that can find only one function or a small group of functions, this leads to
substantial biases in what we know about the function of many proteins.
Mass-spectrometry, microscopy and RNAi experiments dominate high throughput
experiments. Consequently, the functional information derived from these
experiments is mostly of the subcellular location of proteins, and of the
participation of proteins in embryonic developmental pathways. For some
organisms, the information provided by different studies overlap by a large
amount. We also show that the information provided by high throughput
experiments is less specific than those provided by low throughput experiments.
Given the experimental techniques available, certain biases in protein function
annotation due to high-throughput experiments are unavoidable. Knowing that
these biases exist and understanding their characteristics and extent is
important for database curators, developers of function annotation programs,
and anyone who uses protein function annotation data to plan experiments.Comment: Accepted to PLoS Computational Biology. Press embargo applies. v4:
text corrected for style and supplementary material inserte
GAP activity, but not subcellular targeting, is required for Arabidopsis RanGAP cellular and developmental functions
The Ran GTPase activating protein (RanGAP) is important to Ran signaling involved in nucleocytoplasmic transport, spindle organization, and postmitotic nuclear assembly. Unlike vertebrate and yeast RanGAP, plant RanGAP has an N-terminal WPP domain, required for nuclear envelope association and several mitotic locations of Arabidopsis thaliana RanGAP1. A double null mutant of the two Arabidopsis RanGAP homologs is gametophyte lethal. Here, we created a series of mutants with various reductions in RanGAP levels by combining a RanGAP1 null allele with different RanGAP2 alleles. As RanGAP level decreases, the severity of developmental phenotypes increases, but nuclear import is unaffected. To dissect whether the GAP activity and/or the subcellular localization of RanGAP are responsible for the observed phenotypes, this series of rangap mutants were transformed with RanGAP1 variants carrying point mutations abolishing the GAP activity and/or the WPP-dependent subcellular localization. The data show that plant development is differentially affected by RanGAP mutant allele combinations of increasing severity and requires the GAP activity of RanGAP, while the subcellular positioning of RanGAP is dispensable. In addition, our results indicate that nucleocytoplasmic trafficking can tolerate both partial depletion of RanGAP and delocalization of RanGAP from the nuclear envelope
Automated data integration for developmental biological research
In an era exploding with genome-scale data, a major challenge for developmental biologists is how to extract significant clues from these publicly available data to benefit our studies of individual genes, and how to use them to improve our understanding of development at a systems level. Several studies have successfully demonstrated new approaches to classic developmental questions by computationally integrating various genome-wide data sets. Such computational approaches have shown great potential for facilitating research: instead of testing 20,000 genes, researchers might test 200 to the same effect. We discuss the nature and state of this art as it applies to developmental research
Protein subcellular localization prediction based on compartment-specific features and structure conservation
BACKGROUND: Protein subcellular localization is crucial for genome annotation, protein function prediction, and drug discovery. Determination of subcellular localization using experimental approaches is time-consuming; thus, computational approaches become highly desirable. Extensive studies of localization prediction have led to the development of several methods including composition-based and homology-based methods. However, their performance might be significantly degraded if homologous sequences are not detected. Moreover, methods that integrate various features could suffer from the problem of low coverage in high-throughput proteomic analyses due to the lack of information to characterize unknown proteins. RESULTS: We propose a hybrid prediction method for Gram-negative bacteria that combines a one-versus-one support vector machines (SVM) model and a structural homology approach. The SVM model comprises a number of binary classifiers, in which biological features derived from Gram-negative bacteria translocation pathways are incorporated. In the structural homology approach, we employ secondary structure alignment for structural similarity comparison and assign the known localization of the top-ranked protein as the predicted localization of a query protein. The hybrid method achieves overall accuracy of 93.7% and 93.2% using ten-fold cross-validation on the benchmark data sets. In the assessment of the evaluation data sets, our method also attains accurate prediction accuracy of 84.0%, especially when testing on sequences with a low level of homology to the training data. A three-way data split procedure is also incorporated to prevent overestimation of the predictive performance. In addition, we show that the prediction accuracy should be approximately 85% for non-redundant data sets of sequence identity less than 30%. CONCLUSION: Our results demonstrate that biological features derived from Gram-negative bacteria translocation pathways yield a significant improvement. The biological features are interpretable and can be applied in advanced analyses and experimental designs. Moreover, the overall accuracy of combining the structural homology approach is further improved, which suggests that structural conservation could be a useful indicator for inferring localization in addition to sequence homology. The proposed method can be used in large-scale analyses of proteomes
RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information
The attainment of complete map-based sequence for rice (Oryza sativa) is clearly a major milestone for the research community. Identifying the localization of encoded proteins is the key to understanding their functional characteristics and facilitating their purification. Our proposed method, RSLpred, is an effort in this direction for genome-scale subcellular prediction of encoded rice proteins. First, the support vector machine (SVM)-based modules have been developed using traditional amino acid-, dipeptide- (i+1) and four parts-amino acid composition and achieved an overall accuracy of 81.43, 80.88 and 81.10%, respectively. Secondly, a similarity search-based module has been developed using position-specific iterated-basic local alignment search tool and achieved 68.35% accuracy. Another module developed using evolutionary information of a protein sequence extracted from position-specific scoring matrix achieved an accuracy of 87.10%. In this study, a large number of modules have been developed using various encoding schemes like higher-order dipeptide composition, N- and C-terminal, splitted amino acid composition and the hybrid information. In order to benchmark RSLpred, it was tested on an independent set of rice proteins where it outperformed widely used prediction methods such as TargetP, Wolf-PSORT, PA-SUB, Plant-Ploc and ESLpred. To assist the plant research community, an online web tool 'RSLpred' has been developed for subcellular prediction of query rice proteins, which is freely accessible at http://www.imtech.res.in/raghava/rslpred
- …