145 research outputs found

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    Issues in performance evaluation for host–pathogen protein interaction prediction

    Get PDF
    The study of interactions between host and pathogen proteins is important for understanding the underlying mechanisms of infectious diseases and for developing novel therapeutic solutions. Wet-lab techniques for detecting protein–protein interactions (PPIs) can benefit from computational predictions. Machine learning is one of the computational approaches that can assist biologists by predicting promising PPIs. A number of machine learning based methods for predicting host–pathogen interactions (HPI) have been proposed in the literature. The techniques used for assessing the accuracy of such predictors are of critical importance in this domain. In this paper, we question the effectiveness of K-fold cross-validation for estimating the generalization ability of HPI prediction for proteins with no known interactions. K-fold cross-validation does not model this scenario, and we demonstrate a sizable difference between its performance and the performance of an alternative evaluation scheme called leave one pathogen protein out (LOPO) cross-validation. LOPO is more effective in modeling the real world use of HPI predictors, specifically for cases in which no information about the interacting partners of a pathogen protein is available during training. We also point out that currently used metrics such as areas under the precision-recall or receiver operating characteristic curves are not intuitive to biologists and propose simpler and more directly interpretable metrics for this purpose

    Self-Paced Multitask Learning with Shared Knowledge

    Full text link
    This paper introduces self-paced task selection to multitask learning, where instances from more closely related tasks are selected in a progression of easier-to-harder tasks, to emulate an effective human education strategy, but applied to multitask machine learning. We develop the mathematical foundation for the approach based on iterative selection of the most appropriate task, learning the task parameters, and updating the shared knowledge, optimizing a new bi-convex loss function. This proposed method applies quite generally, including to multitask feature learning, multitask learning with alternating structure optimization, etc. Results show that in each of the above formulations self-paced (easier-to-harder) task selection outperforms the baseline version of these methods in all the experiments

    Semi-supervised multi-task learning for predicting interactions between HIV-1 and human proteins

    Get PDF
    Motivation: Protein–protein interactions (PPIs) are critical for virtually every biological function. Recently, researchers suggested to use supervised learning for the task of classifying pairs of proteins as interacting or not. However, its performance is largely restricted by the availability of truly interacting proteins (labeled). Meanwhile, there exists a considerable amount of protein pairs where an association appears between two partners, but not enough experimental evidence to support it as a direct interaction (partially labeled)

    A Fused Elastic Net Logistic Regression Model for Multi-Task Binary Classification

    Full text link
    Multi-task learning has shown to significantly enhance the performance of multiple related learning tasks in a variety of situations. We present the fused logistic regression, a sparse multi-task learning approach for binary classification. Specifically, we introduce sparsity inducing penalties over parameter differences of related logistic regression models to encode similarity across related tasks. The resulting joint learning task is cast into a form that lends itself to be efficiently optimized with a recursive variant of the alternating direction method of multipliers. We show results on synthetic data and describe the regime of settings where our multi-task approach achieves significant improvements over the single task learning approach and discuss the implications on applying the fused logistic regression in different real world settings.Comment: 17 page

    Inter-Species/Host-Parasite Protein Interaction Predictions Reviewed

    Get PDF
    Background: Host-parasite protein interactions (HPPI) are those interactions occurring between a parasite and its host. Host-parasite protein interaction enhances the understanding of how parasite can infect its host. The interaction plays an important role in initiating infections, although it is not all host-parasite interactions that result in infection. Identifying the protein-protein interactions (PPIs) that allow a parasite to infect its host has a lot do in discovering possible drug targets. Such PPIs, when altered, would prevent the host from being infected by the parasite and in some cases, result in the parasite inability to complete specific stages of its life cycle and invariably lead to the death of such parasite. It therefore becomes important to understand the workings of host-parasite interactions which are the major causes of most infectious diseas

    Prediction of human-Bacillus anthracis protein–protein interactions using multi-layer neural network

    Get PDF
    Triplet amino acids have successfully been included in feature selection to predict human-HPV protein-protein interactions (PPI). The utility of supervised learning methods is curtailed due to experimental data not being available in sufficient quantities. Improvements in machine learning techniques and features selection will enhance the study of PPI between host and pathogen.We present a comparison of a neural network model versus SVM for prediction of hostpathogen PPI based on a combination of features including: amino acid quadruplets, pairwise sequence similarity, and human interactome properties. The neural network and SVM were implemented using Python Sklearn library. The neural network model using quadruplet features and other network features outperformance the SVM model. The models are tested against published predictors and then applied to the human-B.anthracis case. Gene ontology term enrichment analysis identifies immunology response and regulation as functions of interacting proteins. For prediction of Human-viral PPI, our model (neural network) is a significant improvement in overall performance compared to a predictor using the triplets feature and achieves a good accuracy in predicting human-B.anthracis PPI

    Mimicry Embedding Facilitates Advanced Neural Network Training for Image-Based Pathogen Detection.

    Get PDF
    The use of deep neural networks (DNNs) for analysis of complex biomedical images shows great promise but is hampered by a lack of large verified data sets for rapid network evolution. Here, we present a novel strategy, termed "mimicry embedding," for rapid application of neural network architecture-based analysis of pathogen imaging data sets. Embedding of a novel host-pathogen data set, such that it mimics a verified data set, enables efficient deep learning using high expressive capacity architectures and seamless architecture switching. We applied this strategy across various microbiological phenotypes, from superresolved viruses to in vitro and in vivo parasitic infections. We demonstrate that mimicry embedding enables efficient and accurate analysis of two- and three-dimensional microscopy data sets. The results suggest that transfer learning from pretrained network data may be a powerful general strategy for analysis of heterogeneous pathogen fluorescence imaging data sets.IMPORTANCE In biology, the use of deep neural networks (DNNs) for analysis of pathogen infection is hampered by a lack of large verified data sets needed for rapid network evolution. Artificial neural networks detect handwritten digits with high precision thanks to large data sets, such as MNIST, that allow nearly unlimited training. Here, we developed a novel strategy we call mimicry embedding, which allows artificial intelligence (AI)-based analysis of variable pathogen-host data sets. We show that deep learning can be used to detect and classify single pathogens based on small differences

    Bradyrhizobium diazoefficiens USDA 110–glycine max interactome provides candidate proteins associated with symbiosis

    Get PDF
    Although the legume−rhizobium symbiosis is a most-important biological process, there is a limited knowledge about the protein interaction network between host and symbiont. Using interolog- and domain-based approaches, we constructed an interspecies protein interactome containing 5115 protein−protein interactions between 2291 Glycine max and 290 Bradyrhizobium diazoefficiens USDA 110 proteins. The interactome was further validated by the expression pattern analysis in nodules, gene ontology term semantic similarity, co-expression analysis, and luciferase complementation image assay. In the G. max−B. diazoefficiens interactome, bacterial proteins are mainly ion channel and transporters of carbohydrates and cations, while G. max proteins are mainly involved in the processes of metabolism, signal transduction, and transport. We also identified the top 10 highly interacting proteins (hubs) for each species. Kyoto Encyclopedia of Genes and Genomes pathway analysis for each hub showed that a pair of 14-3-3 proteins (SGF14g and SGF14k) and 5 heat shock proteins in G. max are possibly involved in symbiosis, and 10 hubs in B. diazoefficiens may be important symbiotic effectors. Subnetwork analysis showed that 18 symbiosis-related soluble N-ethylmaleimide sensitive factor attachment protein receptor proteins may play roles in regulating bacterial ion channels, and SGF14g and SGF14k possibly regulate the rhizobium dicarboxylate transport protein DctA. The predicted interactome provide a valuable basis for understanding the molecular mechanism of nodulation in soybean

    Konak-patojen protein etkileşiminin hesaplamalı yöntemler ile tahmini

    Get PDF
    06.03.2018 tarihli ve 30352 sayılı Resmi Gazetede yayımlanan “Yükseköğretim Kanunu İle Bazı Kanun Ve Kanun Hükmünde Kararnamelerde Değişiklik Yapılması Hakkında Kanun” ile 18.06.2018 tarihli “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” gereğince tam metin erişime açılmıştır.Türler arası patojen-konak protein etkileşimlerin bilinmesi enfeksiyonel hastalıkların teşhis ve tedavisi için geliştirilecek çözüm stratejileri açısından hayati öneme sahiptir. Etkileşim tespitinde kullanılan deneysel yöntemlerin maliyetli olması ve uzun zaman almasından dolayı proteinler arası etkileşimlerin modellendiği hesaplamalı yöntemlerin bu alanda önemli bir yeri vardır. Hesaplamalı yöntemler, tespit süresinin kısaltılması ve maliyetin düşürülmesine ek olarak deneysel yöntemlerle yanlış tespit edilen etkileşimlerin kontrolünde de kullanılmaktadır. Veri seyrekliği, veri yetersizliği ve doğrulanmış negatif veri setinin olmaması, patojen-konak protein etkileşim tahmini için kullanılan hesaplamalı yöntemlerin ortak problemidir. Bu çalışmada amaç patojen-konak etkileşim tahmin doğruluğunu arttırmak ve veri yetersizliğinden kaynaklanan olumsuzlukları gidermektir. Bu kapsamda genişletilmiş ağ modeli ve lokasyon tabanlı kodlama yöntemleri önerildi. Genişletilmiş ağ modeli türler arası yeterli etkileşim verisinin olmadığı patojen konak etkileşimleri ile patojen ve konak proteinlere ait tür içi etkileşimlerin entegre edilmesi tahmin doğruluğunu arttırır hipotezinden esinlenerek geliştirildi. Lokasyon tabanlı kodlama, proteinlerin amino asit diziliminin kodlandığı bir öznitelik çıkarım yöntemidir. Makine öğrenmesi algoritmalarında patojen konak etkileşim tahmininde başarımı etkileyen faktörlerden biri kullanılan özniteliklerdir. Biyolojik veri tabanlarında proteinlere ait en fazla veri amino asit dizilim bilgisidir. Sadece amino asit dizilimini baz alarak geliştirilen güçlü bir öznitelik çıkarım yöntemi, patojen konak etkileşim tahmin doğruluğunu arttıracaktır. Ayrıca amino asit dizilim bilgisinin kullanılması sayesinde bilinen tüm etkileşimler için öznitelik vektörlerinin daha kolay çıkarılması sağlanır. Tezde protein kodlama ve protein etkileşim tahmini üzerine çalışan araştırmacıların kullanılabileceği, ücretsiz erişilebilen, kullanıcı dostu bir ara yüze sahip web tabanlı PROSES (Protein Sequencebased encoding system) yazılımı geliştirildi. Yazılım özellikle programlama bilgisi olmayan kişiler için faydalıdır. PROSES şu anda Yalova Üniversitesi web sunucusunda yer alan http://proses.yalova.edu.tr adresinde kullanılmaktadır.Knowledge of the pathogen-host protein interactions in the inter species has a vital prospect for a solution strategy to be developed against diagnosis and treatment of infectious diseases. Modeling interactions between proteins has necessitated the development of computational methods in this field, since detection of interactions by experimental methods is both time-consuming and costly. Computational methods are used in decreasing of the detection time and cost; in addition checking of the false detected interactions via experimental methods. Data scarcity, data inadequacy, and negative data sampling are the common problems of computational methods for used in prediction of pathogen-host protein interaction. In this study, the purpose is that prediction accuracy of the pathogen-host interaction increase and negativeness eliminate because of data inadequacy. Within thisframework, extended network model and location based encoding approaches are proposed. Firstly, the extended network model is created by inspired from the hypothesis of that integrating the known protein interactions within host and pathogen organisms improve the success of prediction of unknown pathogen-host interactions. Secondly, location based encoding is feature extraction method which is used for encoding of amino acid sequences. One of the important factors is feature which affects success in prediction of pathogen-host interaction within machine learning algorithms. In biological databases, the most data is the information of amino acid sequence regarding proteins. Prediction accuracy of pathogen-host interaction will be increased by that a robust feature extraction method is developed on the basis amino acidsequence. Furthermore, extraction of feature vectors for all the known interactions are provided in easier way by the sake of using the information of amino acid sequence. In this thesis, PROSES (Protein SequencebasedEncodingSystem) which is a user-friendly interface and freely accessible web server, has been designed for researchers, who are working on the field of protein encoding and prediction of protein interaction. The web server is especially useful for those who are not familiar with programming languages. PROSES is currently being used at http://proses.yalova.edu.tr which is storedin the web server of Yalova University
    corecore