3 research outputs found
Human protein function prediction: application of machine learning for integration of heterogeneous data sources
Experimental characterisation of protein cellular function can be prohibitively expensive and
take years to complete. To address this problem, this thesis focuses on the development of computational
approaches to predict function from sequence. For sequences with well characterised
close relatives, annotation is trivial, orphans or distant homologues present a greater challenge.
The use of a feature based method employing ensemble support vector machines to predict individual
Gene Ontology classes is investigated. It is found that different combinations of feature
inputs are required to recognise different functions. Although the approach is applicable to any
human protein sequence, it is restricted to broadly descriptive functions. The method is well
suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate
class assignments.
Signatures of common function can be derived from different biological characteristics; interactions
and binding events as well as expression behaviour. To investigate the hypothesis that
common function can be derived from expression information, public domain human microarray
datasets are assembled. The questions of how best to integrate these datasets and derive
features that are useful in function prediction are addressed. Both co-expression and abundance
information is represented between and within experiments and investigated for correlation with
function. It is found that features derived from expression data serve as a weak but significant
signal for recognising functions. This signal is stronger for biological processes than molecular
function categories and independent of homology information.
The protein domain has historically been coined as a modular evolutionary unit of protein function.
The occurrence of domains that can be linked by ancestral fusion events serves as a signal
for domain-domain interactions. To exploit this information for function prediction, novel domain
architecture and fused architecture scores are developed. Architecture scores rather than
single domain scores correlate more strongly with function, and both architecture and fusion
scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach
designed to target the annotation of both homologous and non-homologous proteins. Support
vector regression is used to combine pair-wise sequence features with expression scores and
domain architecture scores to rank protein pairs in terms of their functional similarities. The
target of the regression models represents the continuum of protein function space empirically
derived from the Gene Ontology molecular function and biological process graphs. The merit
and performance of the approach is demonstrated using homologous and non-homologous test
datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence
methods. The final model represents a method that achieves a compromise between
high specificity and sensitivity for all human proteins regardless of their homology status. It is
expected that this strategy will allow for more comprehensive and accurate annotations of the
human proteome
Front Matter - Soft Computing for Data Mining Applications
Efficient tools and algorithms for knowledge discovery in large data sets have been devised during the recent years. These methods exploit the capability of computers to search huge amounts of data in a fast and effective manner. However, the data to be analyzed is imprecise and afflicted with uncertainty. In the case of heterogeneous data sources such as text, audio and video, the data might moreover be ambiguous and partly conflicting. Besides, patterns and relationships of interest are usually vague and approximate. Thus, in order to make the information mining process more robust or say, human-like methods for searching and learning it requires tolerance towards imprecision, uncertainty and exceptions. Thus, they have approximate reasoning capabilities and are capable of handling partial truth. Properties of the aforementioned kind are typical soft computing. Soft computing techniques like Genetic