5 research outputs found
Human protein function prediction: application of machine learning for integration of heterogeneous data sources
Experimental characterisation of protein cellular function can be prohibitively expensive and
take years to complete. To address this problem, this thesis focuses on the development of computational
approaches to predict function from sequence. For sequences with well characterised
close relatives, annotation is trivial, orphans or distant homologues present a greater challenge.
The use of a feature based method employing ensemble support vector machines to predict individual
Gene Ontology classes is investigated. It is found that different combinations of feature
inputs are required to recognise different functions. Although the approach is applicable to any
human protein sequence, it is restricted to broadly descriptive functions. The method is well
suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate
class assignments.
Signatures of common function can be derived from different biological characteristics; interactions
and binding events as well as expression behaviour. To investigate the hypothesis that
common function can be derived from expression information, public domain human microarray
datasets are assembled. The questions of how best to integrate these datasets and derive
features that are useful in function prediction are addressed. Both co-expression and abundance
information is represented between and within experiments and investigated for correlation with
function. It is found that features derived from expression data serve as a weak but significant
signal for recognising functions. This signal is stronger for biological processes than molecular
function categories and independent of homology information.
The protein domain has historically been coined as a modular evolutionary unit of protein function.
The occurrence of domains that can be linked by ancestral fusion events serves as a signal
for domain-domain interactions. To exploit this information for function prediction, novel domain
architecture and fused architecture scores are developed. Architecture scores rather than
single domain scores correlate more strongly with function, and both architecture and fusion
scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach
designed to target the annotation of both homologous and non-homologous proteins. Support
vector regression is used to combine pair-wise sequence features with expression scores and
domain architecture scores to rank protein pairs in terms of their functional similarities. The
target of the regression models represents the continuum of protein function space empirically
derived from the Gene Ontology molecular function and biological process graphs. The merit
and performance of the approach is demonstrated using homologous and non-homologous test
datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence
methods. The final model represents a method that achieves a compromise between
high specificity and sensitivity for all human proteins regardless of their homology status. It is
expected that this strategy will allow for more comprehensive and accurate annotations of the
human proteome
Insights into the regulation of intrinsically disordered proteins in the human proteome by analyzing sequence and gene expression data
Background:
Disordered proteins need to be expressed to carry out specified functions; however, their accumulation in the cell can potentially cause major problems through protein misfolding and aggregation. Gene expression levels, mRNA decay rates, microRNA (miRNA) targeting and ubiquitination have critical roles in the degradation and disposal of human proteins and transcripts. Here, we describe a study examining these features to gain insights into the regulation of disordered proteins.
Results:
In comparison with ordered proteins, disordered proteins have a greater proportion of predicted ubiquitination sites. The transcripts encoding disordered proteins also have higher proportions of predicted miRNA target sites and higher mRNA decay rates, both of which are indicative of the observed lower gene expression levels. The results suggest that the disordered proteins and their transcripts are present in the cell at low levels and/or for a short time before being targeted for disposal. Surprisingly, we find that for a significant proportion of highly disordered proteins, all four of these trends are reversed. Predicted estimates for miRNA targets, ubiquitination and mRNA decay rate are low in the highly disordered proteins that are constitutively and/or highly expressed.
Conclusions:
Mechanisms are in place to protect the cell from these potentially dangerous proteins. The evidence suggests that the enrichment of signals for miRNA targeting and ubiquitination may help prevent the accumulation of disordered proteins in the cell. Our data also provide evidence for a mechanism by which a significant proportion of highly disordered proteins (with high expression levels) can escape rapid degradation to allow them to successfully carry out their function
Human protein function prediction: application of machine learning for integration of heterogeneous data sources.
Experimental characterisation of protein cellular function can be prohibitively expensive and take years to complete. To address this problem, this thesis focuses on the development of computational approaches to predict function from sequence. For sequences with well characterised close relatives, annotation is trivial, orphans or distant homologues present a greater challenge. The use of a feature based method employing ensemble support vector machines to predict individual Gene Ontology classes is investigated. It is found that different combinations of feature inputs are required to recognise different functions. Although the approach is applicable to any human protein sequence, it is restricted to broadly descriptive functions. The method is well suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate class assignments. Signatures of common function can be derived from different biological characteristics; interactions and binding events as well as expression behaviour. To investigate the hypothesis that common function can be derived from expression information, public domain human microarray datasets are assembled. The questions of how best to integrate these datasets and derive features that are useful in function prediction are addressed. Both co-expression and abundance information is represented between and within experiments and investigated for correlation with function. It is found that features derived from expression data serve as a weak but significant signal for recognising functions. This signal is stronger for biological processes than molecular function categories and independent of homology information. The protein domain has historically been coined as a modular evolutionary unit of protein function. The occurrence of domains that can be linked by ancestral fusion events serves as a signal for domain-domain interactions. To exploit this information for function prediction, novel domain architecture and fused architecture scores are developed. Architecture scores rather than single domain scores correlate more strongly with function, and both architecture and fusion scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach designed to target the annotation of both homologous and non-homologous proteins. Support vector regression is used to combine pair-wise sequence features with expression scores and domain architecture scores to rank protein pairs in terms of their functional similarities. The target of the regression models represents the continuum of protein function space empirically derived from the Gene Ontology molecular function and biological process graphs. The merit and performance of the approach is demonstrated using homologous and non-homologous test datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence methods. The final model represents a method that achieves a compromise between high specificity and sensitivity for all human proteins regardless of their homology status. It is expected that this strategy will allow for more comprehensive and accurate annotations of the human proteome.