13 research outputs found

    Deep Learning Models for Predicting Phenotypic Traits and Diseases from Omics Data

    Get PDF
    Computational analysis of high-throughput omics data, such as gene expressions, copy number alterations and DNA methylation (DNAm), has become popular in disease studies in recent decades because such analyses can be very helpful to predict whether a patient has certain disease or its subtypes. However, due to the high-dimensional nature of the data sets with hundreds of thousands of variables and very small number of samples, traditional machine learning approaches, such as support vector machines (SVMs) and random forests, have limitations to analyze these data efficiently. In this chapter, we reviewed the progress in applying deep learning algorithms to solve some biological questions. The focus is on potential software tools and public data sources for the tasks. Particularly, we show some case studies using deep neural network (DNN) models for classifying molecular subtypes of breast cancer and DNN-based regression models to account for interindividual variation in triglyceride concentrations measured at different visits of peripheral blood samples using DNAm profiles. We show that integration of multi-omics profiles into DNN-based learning methods could improve the prediction of the molecular subtypes of breast cancer. We also demonstrate the superiority of our proposed DNN models over the SVM model for predicting triglyceride concentrations

    Hierarchical representation for PPI sites prediction

    Get PDF
    Background: Protein–protein interactions have pivotal roles in life processes, and aberrant interactions are associated with various disorders. Interaction site identification is key for understanding disease mechanisms and design new drugs. Effective and efficient computational methods for the PPI prediction are of great value due to the overall cost of experimental methods. Promising results have been obtained using machine learning methods and deep learning techniques, but their effectiveness depends on protein representation and feature selection. Results: We define a new abstraction of the protein structure, called hierarchical representations, considering and quantifying spatial and sequential neighboring among amino acids. We also investigate the effect of molecular abstractions using the Graph Convolutional Networks technique to classify amino acids as interface and no-interface ones. Our study takes into account three abstractions, hierarchical representations, contact map, and the residue sequence, and considers the eight functional classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0. The performance of our method, evaluated using standard metrics, is compared to the ones obtained with some state-of-the-art protein interface predictors. The analysis of the performance values shows that our method outperforms the considered competitors when the considered molecules are structurally similar. Conclusions: The hierarchical representation can capture the structural properties that promote the interactions and can be used to represent proteins with unknown structures by codifying only their sequential neighboring. Analyzing the results, we conclude that classes should be arranged according to their architectures rather than functions

    Machine learning solutions for predicting protein–protein interactions

    Get PDF
    Proteins are social molecules. Recent experimental evidence supports the notion that large protein aggregates, known as biomolecular condensates, affect structurally and functionally many biological processes. Condensate formation may be permanent and/or time dependent, suggesting that biological processes can occur locally, depending on the cell needs. The question then arises as to which extent we can monitor protein-aggregate formation, both experimentally and theoretically and then predict/simulate functional aggregate formation. Available data are relative to mesoscopic interacting networks at a proteome level, to protein-binding affinity data, and to interacting protein complexes, solved with atomic resolution. Powerful algorithms based on machine learning (ML) can extract information from data sets and infer properties of never-seen-before examples. ML tools address the problem of protein–protein interactions (PPIs) adopting different data sets, input features, and architectures. According to recent publications, deep learning is the most successful method. However, in ML-computational biology, convincing evidence of a success story comes out by performing general benchmarks on blind datasets. Results indicate that the state-of-the-art ML approaches, based on traditional and/or deep learning, can still be ameliorated, irrespectively of the power of the method and richness in input features. This being the case, it is quite evident that powerful methods still are not trained on the whole possible spectrum of PPIs and that more investigations are necessary to complete our knowledge of PPI-functional interaction

    Reciprocal Perspective for Improved Protein-Protein Interaction Prediction

    Get PDF
    All protein-protein interaction (PPI) predictors require the determination of an operational decision threshold when differentiating positive PPIs from negatives. Historically, a single global threshold, typically optimized via cross-validation testing, is applied to all protein pairs. However, we here use data visualization techniques to show that no single decision threshold is suitable for all protein pairs, given the inherent diversity of protein interaction profiles. The recent development of high throughput PPI predictors has enabled the comprehensive scoring of all possible protein-protein pairs. This, in turn, has given rise to context, enabling us now to evaluate a PPI within the context of all possible predictions. Leveraging this context, we introduce a novel modeling framework called Reciprocal Perspective (RP), which estimates a localized threshold on a per-protein basis using several rank order metrics. By considering a putative PPI from the perspective of each of the proteins within the pair, RP rescores the predicted PPI and applies a cascaded Random Forest classifier leading to improvements in recall and precision. We here validate RP using two state-of-the-art PPI predictors, the Protein-protein Interaction Prediction Engine and the Scoring PRotein INTeractions methods, over five organisms: Homo sapiens, Saccharomyces cerevisiae, Arabidopsis thaliana, Caenorhabditis elegans, and Mus musculus. Results demonstrate the application of a post hoc RP rescoring layer significantly improves classification (p < 0.001) in all cases over all organisms and this new rescoring approach can apply to any PPI prediction method

    A guide to machine learning for biologists

    Get PDF
    The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed

    An Extended Feature Representation Technique for Predicting Sequenced-based Host-pathogen Protein-protein Interaction

    Get PDF
    Background: The use of machine learning models in sequence-based Protein-Protein Interaction prediction typically requires the conversion of amino acid sequences into feature vectors. From the literature, two approaches have been used to achieve this transformation. These are referred to as the Independent Protein Feature (IPF) and Merged Protein Feature (MPF) extraction methods. As observed, studies have predominantly adopted the IPF approach, while others preferred the MPF method, in which host and pathogen sequences are concatenated before feature encoding. Objective: This presents the challenge of determining which approach should be adopted for improved HPPPI prediction. Therefore, this work introduces the Extended Protein Feature (EPF) method. Methods: The proposed method combines the predictive capabilities of IPF and MPF, extracting essential features, handling multicollinearity, and removing features with zero importance. EPF, IPF, and MPF were tested using bacteria, parasite, virus, and plant HPPPI datasets and were deployed to machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Multilayer Perceptron (MLP), Naïve Bayes (NB), Logistic Regression (LR), and Deep Forest (DF). Results: The results indicated that MPF exhibited the lowest performance overall, whereas IPF performed better with decision tree-based models, such as RF and DF. In contrast, EPF demonstrated improved performance with SVM, LR, NB, and MLP and also yielded competitive results with DF and RF. Conclusion: In conclusion, the EPF approach developed in this study exhibits substantial improvements in four out of the six models evaluated. This suggests that EPF offers competitiveness with IPF and is particularly well-suited for traditional machine learning models

    皮類の免疫ペプチド分類問題を解決する機械孊習アプロヌチ

    Get PDF
    Peptides play an important role in all aspects of the immunological reactions to invading cancer and pathogen cells. It has been known for over 40-years that peptides are critical influences in assembling the immune system against foreign invaders. Since then, new knowledge about the generation and function of peptides in immunology has supported efforts to harness the immune system to treat disease. Yet, with little immunological insight, most of the highly productive treatments, including vaccines, have been developed empirically. Nonetheless, increased knowledge of the biology of antigen processing as well as chemistry and pharmacological properties of antigenic and antimicrobial peptides has now permitted to development of drugs and vaccines. Due to advanced technologies, it is vitally important to develop automatic computational methods for rapidly and accurately predicting immune-peptides. In this thesis, the author focuses on the machine learning approaches for addressing classification problems of four types of immune-peptides (anti-inflammatory, proinflammatory, anti-tuberculosis, and linear B-cell peptides).Numerous inflammatory diseases and autoimmune disorders by therapeutic peptides have received substantial consideration; however, the exploration of anti-inflammatory peptides via biological experiments is often a time consuming and expensive task. The development of novel in silico predictors is desired to classify potential anti-inflammatory peptides prior to in vitro investigation. Herein, an accurate predictor, called PreAIP (Predictor of Anti-Inflammatory Peptides) was developed by integrating multiple complementary features. We systematically investigated different types of features including primary sequence, evolutionary and structural information through a random forest classifier. The final PreAIP model achieved an AUC value of 0.833 in the training dataset via 10-fold cross-validation test, which was better than that of existing models. Moreover, we assessed the performance of the PreAIP with an AUC value of 0.840 on a test dataset to demonstrate that the proposed method outperformed the two existing methods. These results indicated that the PreAIP is an accurate predictor for identifying anti-inflammatory peptides and contributes to the development of anti-inflammatory peptides therapeutics and biomedical research. The curated datasets and the PreAIP are freely available at http://kurata14.bio.kyutech.ac.jp/PreAIP/. A proinflammatory peptide (PIP) is a type of signaling molecules that are secreted from immune cells, which contributes to the first line of defense against invading pathogens. Numerous experiments have shown that PIPs play an important role in human physiology such as vaccines and immunotherapeutic drugs. Considering high-throughput laboratory methods that are time consuming and costly, effective computational methods are great demand to timely and accurately identify PIPs. Thus, in this study, we proposed a computational model in conjunction with a multiple feature representation, called ProIn-Fuse, to improve the performance of PIPs identification. Specifically, a feature representation learning model was utilized to generate a set of informative probabilistic features by making the use of random forest models with eight sequence encoding schemes. Finally, the ProIn-Fuse was constructed by the linearly combined models of the informative probabilistic features. The generalization capability of our proposed method evaluated through independent test showed that ProIn-Fuse yielded an accuracy of 0.746, which was over 10% higher than those obtained by the state-of-the-art PIP predictors. Cross-validation and independent results consistently demonstrated that ProIn-Fuse is more precise and promising in the identification of PIPs than existing PIP predictors. The web server, datasets and online instruction are freely accessible at http://kurata14.bio.kyutech.ac.jp/ProIn-Fuse/. We believe that the proposed ProIn-Fuse can facilitate faster and broader applications of PIPs in drug design and development. Tuberculosis (TB) is a leading killer caused by Mycobacterium tuberculosis. Recently anti-TB peptides have provided an alternative approach to combat antibiotic tolerance. Herein, we have developed an effective computational predictor iAntiTB (identification of anti-tubercular peptides) that integrates multiple feature vectors deriving from the amino acid sequences via Random Forest (RF) and Support Vector Machine (SVM) classifiers. The iAntiTB combined the RF and SVM scores via linear regression to enhance the prediction accuracy. To make a robust and accurate predictor we prepared the two datasets with different types of negative samples. The iAntiTB achieved AUC values of 0.896 and 0.946 on the training datasets of the first and second datasets, respectively. The iAntiTB outperformed the other existing predictors. Thus, the iAntiTB is a robust and accurate predictor that is helpful for researchers working on peptide therapeutics and immunotherapy. All the employed datasets and software application are accessible at http://kurata14.bio.kyutech.ac.jp/iAntiTB/. Linear B-cell peptides are critically important for immunological applications such as vaccine design, immunodiagnostic tests, antibody production, and disease diagnosis and therapy. The accurate identification of linear B-cell peptides remains challenging despite several decades of research. In this work, we have developed a novel predictor, iLBE (Identification of B-Cell Epitope), by integrating evolutionary and sequence-based features. The successive feature vectors were optimized by a Wilcoxon rank-sum test. Then the random forest (RF) algorithm used the optimal consecutive feature vectors to predict linear B-cell epitopes. We combined the RF scores by the logistic regression to enhance the prediction accuracy. The performance of the final iLBE yielded an AUC score of 0.809 on the training dataset. It outperformed other existing prediction models on a comprehensive independent dataset. The iLBE is suggested to be a powerful computational tool to identify the linear B-cell peptides and development of penetrating diagnostic tests. A web application with curated datasets is freely accessible of iLBE at http://kurata14.bio.kyutech.ac.jp/iLBE/. Taken together, the above results suggest that our proposed predictors (PreAIP, ProIn-Fuse, iAntiTB, and iLBE) would be helpful computational resources for the prediction of anti-inflammatory, pro-inflammatory, tuberculosis, and linear B-cell peptides. / ペプチドは、癌や病原䜓现胞に察する免疫反応のあらゆる偎面で重芁な圹割を果たす。ペプチドが倖来の䟵入物に察する免疫系を起動する䞊で決定的な圱響を䞎えるこずは40幎以䞊前から知られおいる。それ以来、免疫孊におけるペプチドの生成ず機胜に関する新しい知芋は、病気を治療するために免疫系を利甚する研究を支えおきた。䟝然ずしお、免疫孊的掞察がほずんどないため、ワクチンを含む効率的治療法のほずんどは、経隓的に開発されおいる。それでもなお、抗原プロセシングの生物孊、ならびに抗原性および抗菌性ペプチドの化孊・薬理孊に関する知芋の増加により、珟圚、薬物およびワクチンの開発が可胜になっおいる。高床な技術により、免疫ペプチドを迅速か぀正確に予枬するためのコンピュヌタ技術を開発するこずが非垞に重芁である。この論文では、著者は4皮類の免疫ペプチド抗炎症、炎症誘発性、抗結栞、および線圢B现胞゚ピトヌプの分類問題に察凊するための機械孊習アプロヌチに焊点を圓おる。炎症性疟患および自己免疫疟患に察する治療甚ペプチドは、倚くの怜蚎がなされおきた。しかし、生物孊的実隓による抗炎症ペプチドの探玢は、倚くの堎合、時間ず費甚のかかる䜜業である。新しいin siloco予枬噚の開発は、in vitro実隓に先立っお、朜圚的な抗炎症ペプチドを同定するために望たれおいる。ここでは、PreAIP抗炎症ペプチドの予枬噚ず呌ばれる予枬噚が、耇数の補完的機胜を統合するこずによっお開発された。䞀次配列、進化的および構造的情報を含むさたざたなタむプの特城量を、ランダムフォレスト分類噚を介しお抜出した。最終的なPreAIPモデルは、10分割亀差怜定によるトレヌニングデヌタセットで0.833のAUC倀を達成した。これは、既存のモデルよりも優れた倀である。さらに、独立の怜蚌甚デヌタセットでAUC倀0.840を達成し、提案された方法が2぀の既存の予枬噚よりも優れおいるこずを瀺した。これらの結果は、PreAIPが抗炎症ペプチドを同定するための正確な予枬噚であり、抗炎症ペプチド治療および生物医孊研究の開発に貢献した。甚いたデヌタセットずPreAIPは、http//kurata14.bio.kyutech.ac.jp/PreAIP/から自由に利甚できる。炎症誘発性ペプチドPIPは、免疫现胞から分泌されるシグナル䌝達分子の䞀皮であり、䟵入する病原䜓に察する防埡の第䞀線を担圓する。倚くの実隓により、PIPはワクチンや免疫療法薬などにおいお重芁な圹割を果たすこずが瀺されおいる。ハむスルヌプットな生物実隓に時間ず費甚が掛かるこずを考えるず、効率的なコンピュヌタ予枬は、PIPを短時間にか぀正確に特定するために倧きな需芁がある。したがっお、この研究では、PIP識別性胜を向䞊させるために、ProIn-Fuseず呌ばれる耇数の特城衚珟を組み合わせた蚈算モデルを提案した。具䜓的には、特城衚珟孊習モデルを利甚しお、8぀のシヌケンス゚ンコヌディングスキヌムを備えたランダムフォレストモデルを利甚するこずにより、確率的予枬スコアを蚈算した。ProIn-Fuseは、確率的予枬スコアの線圢結合モデルによっお構築された。提案手法の汎化性胜を独立したテストデヌタで評䟡した結果、ProIn-Fuseの粟床は0.746であり、これは最新のPIP予枬噚によっお埗られた粟床よりも10以䞊高かった。テストデヌタによる怜蚌結果は、ProIn-Fuseが既存のPIP予枬噚よりも正確にPIP識別できるこずを瀺した。Webサヌバヌ、デヌタセット、および説明曞は、http//kurata14.bio.kyutech.ac.jp/ProIn-Fuse/から自由にアクセスできる。ProIn-Fuseは、ドラッグデザむン含む幅広いアプリケヌションに応甚できる。結栞TBは、結栞菌によっお匕き起こされる疟患である。最近、抗結栞ペプチドは抗生物質耐性に察抗するための代替アプロヌチを提䟛しおいる。ここでは、ランダムフォレストRFおよびサポヌトベクタヌマシンSVM分類噚を甚いおアミノ酞配列に由来する耇数の特城ベクトルを統合する効果的な予枬噚iAntiTB抗結栞ペプチドの識別を開発した。iAntiTBは、線圢回垰を介しおRFスコアずSVMスコアを組み合わせお、予枬粟床を向䞊させた。ロバストで正確な予枬噚を䜜成するために、異なるタむプのネガティブサンプルを䜿甚しお2぀のデヌタセットを準備した。iAntiTBは、1番目ず2番目のデヌタセットのトレヌニングデヌタセットでそれぞれ0.896ず0.946のAUC倀を達成した。iAntiTBは、他の既存の予枬噚の性胜を䞊回った。このように、iAntiTBは、ペプチド治療および免疫療法に取り組んでいる研究者に圹立぀ロバストで正確な予枬噚である。利甚されたすべおのデヌタセットず゜フトりェアアプリケヌションは、http//kurata14.bio.kyutech.ac.jp/iAntiTB/から自由にアクセスできる。線圢B现胞゚ピトヌプは、ワクチンの蚭蚈、免疫蚺断テスト、抗䜓産生、疟患の蚺断や治療などの免疫孊的応甚に非垞に重芁である。線圢B现胞゚ピトヌプの正確な同定は、数十幎の研究にもかかわらず、䟝然ずしお挑戊的課題のたたである。本研究では、配列の進化的特城や物理化孊的特城等を統合するこずにより、新芏な線圢B现胞゚ピトヌプ予枬モデルiLBEを開発した。Wilcoxon順䜍和怜定によっお最適化した特城ベクトル矀をランダムフォレストRFアルゎリズムを甚いお孊習しお、線圢B现胞゚ピトヌプの予枬スコアを蚈算した。ロゞスティック回垰を甚いおRFスコアを組合せお、予枬粟床を高めた。iLBEは、トレヌニングデヌタセットで0.809のAUCを達成し、独立のテストデヌタセットを甚いた怜定では、既存の予枬モデルの性胜を超えた。線圢B现胞゚ピトヌプを同定する匷力な蚈算ツヌルであるiLBEは、蚺断テストの開発に有甚である。泚釈付きデヌタセットを備えたiLBEモデルのり゚ブアプリケヌションは自由にアクセスできるhttp://kurata14.bio.kyutech.ac.jp/iLBE/。九州工業倧孊博士孊䜍論文 孊䜍蚘番号情工博甲第358号 孊䜍授䞎幎月日什和3幎3月25日1 Introduction|2 Prediction of Anti-Inflammatory Peptides by Integrating Mulptle Complementary Features|3 Prediction of Proinflammatory Peptides by Fusing of Multiple Feature Representations|4 Prediction of Anti-Tubercular Peptides by Exploiting Amino Acid Pattern and Properties|5 Prediction of Linear B-Cell Epitopes by Integrating Sequence and Evolutionary Features|6 Conclusions and Perspectives九州工業倧孊什和2幎
    corecore