25 research outputs found

    DEEPred: Automated Protein Function Prediction with Multi-task Feed-forward Deep Neural Networks

    Get PDF
    Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests, and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms form different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the 'biofilm formation process' in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations. The source code and all datasets used in this study are available at: https://github.com/cansyl/DEEPred

    Crowdsourced mapping of unexplored target space of kinase inhibitors

    Get PDF
    Despite decades of intensive search for compounds that modulate the activity of particular protein targets, a large proportion of the human kinome remains as yet undrugged. Effective approaches are therefore required to map the massive space of unexplored compound-kinase interactions for novel and potent activities. Here, we carry out a crowdsourced benchmarking of predictive algorithms for kinase inhibitor potencies across multiple kinase families tested on unpublished bioactivity data. We find the top-performing predictions are based on various models, including kernel learning, gradient boosting and deep learning, and their ensemble leads to a predictive accuracy exceeding that of single-dose kinase activity assays. We design experiments based on the model predictions and identify unexpected activities even for under-studied kinases, thereby accelerating experimental mapping efforts. The open-source prediction algorithms together with the bioactivities between 95 compounds and 295 kinases provide a resource for benchmarking prediction algorithms and for extending the druggable kinome. The IDG-DREAM Challenge carried out crowdsourced benchmarking of predictive algorithms for kinase inhibitor activities on unpublished data. This study provides a resource to compare emerging algorithms and prioritize new kinase activities to accelerate drug discovery and repurposing efforts

    İlaç-hedef protein etkileşim uzayı ve protein fonksiyonlarının tahmini için derin öğrenme.

    No full text
    With the advancement of sequencing and high-throughput screening technologies, large amount of sequence and compound data have been accumulated in biological and chemical databases. However, only small number of proteins and compounds have been annotated by wet-lab experiments due to the huge compound and chemical space. Therefore, computational methods have been developed to annotate protein and compound space. In this thesis, we describe the design and implementation of several methods for accurate drug-target interaction prediction and functional annotations of proteins within the framework of Comprehensive Resource of Biomedical Relations with Deep Learning and Network Representations (CROssBAR) project whose aim is to integrate biological and chemical data scattered in different sources and to create prediction methods for drug discovery based on deep learning. The first method, DEEPred is a sequence based automated protein function prediction method that employs a stacked multi-task deep neural networks based on Gene Ontology (GO) directed acyclic graph hierarchy. The performance of DEEPred was compared with state-of-the-art methods and its source code is available at https://github.com/cansyl/deepred. DEEPScreen is the second method and it is a drug-target interaction (binary) prediction method. In DEEPScreen, the idea is to learn compound features automatically using compound images via convolutional neural networks. DEEPScreen was trained for 704 target proteins and the input compounds predicted as active or inactive against trained targets. The performance of DEEPScreen was compared with the state-of-the art methods using different benchmarking datasets. The source code is available at https://github.com/cansyl/DEEPScreen. The third method is called MDeePred which is a binding affinity prediction method. MDeePred is a chemogenomic method where both protein and compounds features were fed to a hybrid pairwise deep neural network structure. The main difference between MDeePred and DEEPScreen in terms of features is that MDeePred employs compound-target feature pairs whereas in DEEPScreen only compound features were used. The main novelty of MDeePred is the proposed multi-channel featurization approach for protein sequences where each channel represents a different property of input protein sequences. The performance of MDeePred was calculated on multiple benchmarking datasets and compared its performance with the state-of-the-art methods. The source code for MDeePred is available at https://github.com/cansyl/MDeePred. The fourth method is called iBioProVis which is an online interactive visualization tool for chemical space. The main purpose of iBioProVis is to embed and visualize compound features on 2-D space. It relies on the assumption that topologically and chemically similar compounds have similar bioactivity profiles. The inputs for iBioProVis are target protein identifiers and optionally, SMILES strings of user-input compounds. The tool then generates circular fingerprints for active compounds of targets and user-input compounds and then, t-Stochastic Neighbor Embedding (t-SNE) method is used to embed compounds on 2-D space. The tool also provides cross-references for well-known databases for input targets and compounds. iBioProVis is available at https://ibioprovis.kansil.org/.Thesis (Ph.D.) -- Graduate School of Natural and Applied Sciences. Computer Engineering

    Tüm gen ontolojisi ve ECnumaralari için Swiss-Prot ve TrEMBL dizilerini anlamlandirmak amaciyla GOPred yönteminin genişletilmesi.

    No full text
    Traditional protein function annotation methods cannot keep up with annotation of proteins as the number of proteins whose sequences known is increasing exponentially. For this reason, protein function prediction became an important research area. In this thesis, GOPred method is used with improvements for protein function prediction problem. GOPred consists of SPMap, Blast-kNN and Pepstats methods which are subsequence, similarity and feature based methods, respectively. Previous version of GOPred method used for functional classification of proteins based on 300 molecular function Gene Ontology (GO) terms. In this study, improved system is trained for 514 molecular function, 2909 biological process and 438 cellular component GO terms. The system is also applied on functional prediction of enzymes based on 851 Enzyme Commission (EC) Numbers. Hierarchical evaluation of predictions is proposed to give reliable predictions for EC numbers. In addition, we used a new method to calculate optimal decision thresholds for each functional term to determine the predictions that will be given. Optimal thresholds are calculated for each functional term and predictions whose scores are over determined optimal thresholds are presented. Performances of functional terms are measured separately and averages of performances are calculated to evaluate the system. GO term prediction results show that performance of our system is better for prediction of multi-functional proteins. To the best of our knowledge, this is the best performance achieved for EC number prediction in the literature. Improved system is tested on about 58 million TrEMBL proteins to compare predictions that are given by our system with the reference systems that give annotations for TrEMBL database which are EMBL, HAMAP, PDB, PIR, PIRNR and RuleBase. Results show that, most of the predictions that are given by our system are consistent with the predictions that are given by other systems.M.S. - Master of Scienc

    Unsupervised identification of redundant domain entries in InterPro database using clustering techniques

    No full text
    InterPro is a widely used database that integrates functional signatures provided by different protein sequence annotation databases with manual curation; in order to present a comprehensive database of functional sequence annotation. However, the integration of the signatures causes inconsistent and/or redundant annotations in some cases. In this study, we proposed an unsupervised method for the automatic detection of inconsistent and redundant entries in the InterPro database. Two clustering methods: Markov Cluster Algorithm (MCL) and hierarchical clustering are employed in order to investigate to what extent these signatures can be detected. Results show that a considerable amount of (~75%) redundant entries can be identified. The future goal is to develop a system that does the identification of redundant and inconsistent signatures with very high performance using machine learning techniques in a supervised fashion. The findings of the study may aid InterPro curators to fix the problematic entries. It may also be used by curators as a road map before the integration of new signatures

    Protein İşlevlerinin Altdizi Analizi ile Büyük Ölçekte Öngörme Yöntemleri

    No full text
    Proteinlerin işlevlerinin otomatik olarak etiketlenmesi (automatic protein function annotation), işlemsel biyolojinin (computational biology) önemli ve zor problemlerinden birisidir. Araştırma grubumuzun daha önceki çalışmalarında, yapay öğrenme (machine learning) yöntemleri yardımı ile altdizi benzerliğine dayalı öznitelik uzayı eşlemesi (subsequence similarity based feature mapping) kullanılarak protein dizilerinin işlevsel sınıflandırması için Subsequence Profile Map (SPMap) sistemini geliştirmiştik. Bu projenin amaçları geliştirmiş olduğumuz SPMap sisteminin iyileştirilmesi için bazı adımların yeniden tasarlanıp gerçekleştirilmesi, daha önce üstünde durulmamış olan noktaların açığa çıkartılması ve bu iki maddenin büyük ölçekli veri kümesine uygulanmasının ardından tüm sonuçların genel erişime açık olarak sunulmasıdır

    Multi-task Deep Neural Networks in Protein Function Prediction

    No full text
    In recent years, deep learning algorithms have outperformed the state-of-the art methods in several areas thanks to the efficient methods for training and for preventing overfitting, advancement in computer hardware, the availability of vast amount data. The high performance of multi-task deep neural networks in drug discovery has attracted the attention to deep learning algorithms in bioinformatics area. Here, we proposed a hierarchical multi-task deep neural network architecture based on Gene Ontology (GO) terms as a solution to protein function prediction problem and investigated various aspects of the proposed architecture by performing several experiments. First, we showed that there is a positive correlation between performance of the system and the size of training datasets. Second, we investigated whether the level of GO terms on GO hierarchy related to their performance. We showed that there is no relation between the depth of GO terms on GO hierarchy and their performance. In addition, we included all annotations to the training of a set of GO terms to investigate whether including noisy data to the training datasets change the performance of the system. The results showed that including less reliable annotations in training of deep neural networks increased the performance of the low performed GO terms, significantly. We evaluated the performance of the system using hierarchical evaluation method. Mathews correlation coefficient was calculated as 0.75, 0.49 and 0.63 for molecular function, biological process and cellular component categories, respectively. We showed that deep learning algorithms have a great potential in protein function prediction area. We plan to further improve the DEEPred by including other types of annotations from various biological data sources. We plan to construct DEEPred as an open access online tool

    Investigation of Multi-task Deep Neural Networks in Automated Protein Function Prediction

    No full text
    Functional annotation of proteins is a crucial research field for understanding molecular mechanisms of living-beings and for biomedical purposes (e.g. identification of disease-causing functional changes in genes and for discovering novel drugs). Several Gene Ontology (GO) based protein function prediction methods have been proposed in the last decade to annotate proteins. However, considering the prediction performances of the proposed methods, it can be stated that there is still room for significant improvements in protein function prediction area (1). Deep learning techniques became popular in recent years and turned out to be an industry standard in several areas such as computer vision and speech recognition. To the best of our knowledge, as of today, deep learning algorithms have not been applied to the large-scale protein function prediction problem. Here, we propose a hierarchical multi-task deep neural network architecture, DEEPred, as a solution to protein function prediction problem. First of all, we investigated the potential of employing deep learning methods for protein function prediction. For this purpose, we measured the performance of our models at different parameter settings. Furthermore, we examined the relationship between the performance of the system and the size of the training datasets, since the training set size has been reported in the literature to be significantly affecting the performance of deep learning models
    corecore