16,761 research outputs found
Applicability of semi-supervised learning assumptions for gene ontology terms prediction
Gene Ontology (GO) is one of the most important resources in bioinformatics, aiming to provide a unified framework for the biological annotation of genes and proteins across all species. Predicting GO terms is an essential task for bioinformatics, but the number of available labelled proteins is in several cases insufficient for training reliable machine learning classifiers. Semi-supervised learning methods arise as a powerful solution that explodes the information contained in unlabelled data in order to improve the estimations of traditional supervised approaches. However, semi-supervised learning methods have to make strong assumptions about the nature of the training data and thus, the performance of the predictor is highly dependent on these assumptions. This paper presents an analysis of the applicability of semi-supervised learning assumptions over the specific task of GO terms prediction, focused on providing judgment elements that allow choosing the most suitable tools for specific GO terms. The results show that semi-supervised approaches significantly outperform the traditional supervised methods and that the highest performances are reached when applying the cluster assumption. Besides, it is experimentally demonstrated that cluster and manifold assumptions are complimentary to each other and an analysis of which GO terms can be more prone to be correctly predicted with each assumption, is provided.Postprint (published version
Multi-task Deep Neural Networks in Automated Protein Function Prediction
In recent years, deep learning algorithms have outperformed the state-of-the
art methods in several areas thanks to the efficient methods for training and
for preventing overfitting, advancement in computer hardware, the availability
of vast amount data. The high performance of multi-task deep neural networks in
drug discovery has attracted the attention to deep learning algorithms in
bioinformatics area. Here, we proposed a hierarchical multi-task deep neural
network architecture based on Gene Ontology (GO) terms as a solution to protein
function prediction problem and investigated various aspects of the proposed
architecture by performing several experiments. First, we showed that there is
a positive correlation between performance of the system and the size of
training datasets. Second, we investigated whether the level of GO terms on GO
hierarchy related to their performance. We showed that there is no relation
between the depth of GO terms on GO hierarchy and their performance. In
addition, we included all annotations to the training of a set of GO terms to
investigate whether including noisy data to the training datasets change the
performance of the system. The results showed that including less reliable
annotations in training of deep neural networks increased the performance of
the low performed GO terms, significantly. We evaluated the performance of the
system using hierarchical evaluation method. Mathews correlation coefficient
was calculated as 0.75, 0.49 and 0.63 for molecular function, biological
process and cellular component categories, respectively. We showed that deep
learning algorithms have a great potential in protein function prediction area.
We plan to further improve the DEEPred by including other types of annotations
from various biological data sources. We plan to construct DEEPred as an open
access online tool.Comment: 19 pages, 4 figures, 4 table
- …