15 research outputs found

    Identification of Protein Pupylation Sites Using Bi-Profile Bayes Feature Extraction and Ensemble Learning

    Get PDF
    Pupylation, one of the most important posttranslational modifications of proteins, typically takes place when prokaryotic ubiquitin-like protein (Pup) is attached to specific lysine residues on a target protein. Identification of pupylation substrates and their corresponding sites will facilitate the understanding of the molecular mechanism of pupylation. Comparing with the labor-intensive and time-consuming experiment approaches, computational prediction of pupylation sites is much desirable for their convenience and fast speed. In this study, a new bioinformatics tool named EnsemblePup was developed that used an ensemble of support vector machine classifiers to predict pupylation sites. The highlight of EnsemblePup was to utilize the Bi-profile Bayes feature extraction as the encoding scheme. The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross validation on the training dataset. When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629. The experimental results suggested that EnsemblePup presented here might be useful to identify and annotate potential pupylation sites in proteins of interest. A web server for predicting pupylation sites was developed

    JPPRED: Prediction of Types of J-Proteins from Imbalanced Data Using an Ensemble Learning Method

    Get PDF

    Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information

    Get PDF
    Protein phosphorylation on serine (S) and threonine (T) has emerged as a key device in the control of many biological processes. Recently phosphorylation in microbial organisms has attracted much attention for its critical roles in various cellular processes such as cell growth and cell division. Here a novel machine learning predictor, MPSite (Microbial Phosphorylation Site predictor), was developed to identify microbial phosphorylation sites using the enhanced characteristics of sequence features. The final feature vectors optimized via a Wilcoxon rank sum test. A random forest classifier was then trained using the optimum features to build the predictor. Benchmarking investigation using the 5-fold cross-validation and independent datasets test showed that the MPSite is able to achieve robust performance on the S- and T-phosphorylation site prediction. It also outperformed other existing methods on the comprehensive independent datasets. We anticipate that the MPSite is a powerful tool for proteome-wide prediction of microbial phosphorylation sites and facilitates hypothesis-driven functional interrogation of phosphorylation proteins. A web application with the curated datasets is freely available at http://kurata14.bio.kyutech.ac.jp/MPSite/

    4種類の免疫ペプチド分類問題を解決する機械学習アプローチ

    Get PDF
    Peptides play an important role in all aspects of the immunological reactions to invading cancer and pathogen cells. It has been known for over 40-years that peptides are critical influences in assembling the immune system against foreign invaders. Since then, new knowledge about the generation and function of peptides in immunology has supported efforts to harness the immune system to treat disease. Yet, with little immunological insight, most of the highly productive treatments, including vaccines, have been developed empirically. Nonetheless, increased knowledge of the biology of antigen processing as well as chemistry and pharmacological properties of antigenic and antimicrobial peptides has now permitted to development of drugs and vaccines. Due to advanced technologies, it is vitally important to develop automatic computational methods for rapidly and accurately predicting immune-peptides. In this thesis, the author focuses on the machine learning approaches for addressing classification problems of four types of immune-peptides (anti-inflammatory, proinflammatory, anti-tuberculosis, and linear B-cell peptides).Numerous inflammatory diseases and autoimmune disorders by therapeutic peptides have received substantial consideration; however, the exploration of anti-inflammatory peptides via biological experiments is often a time consuming and expensive task. The development of novel in silico predictors is desired to classify potential anti-inflammatory peptides prior to in vitro investigation. Herein, an accurate predictor, called PreAIP (Predictor of Anti-Inflammatory Peptides) was developed by integrating multiple complementary features. We systematically investigated different types of features including primary sequence, evolutionary and structural information through a random forest classifier. The final PreAIP model achieved an AUC value of 0.833 in the training dataset via 10-fold cross-validation test, which was better than that of existing models. Moreover, we assessed the performance of the PreAIP with an AUC value of 0.840 on a test dataset to demonstrate that the proposed method outperformed the two existing methods. These results indicated that the PreAIP is an accurate predictor for identifying anti-inflammatory peptides and contributes to the development of anti-inflammatory peptides therapeutics and biomedical research. The curated datasets and the PreAIP are freely available at http://kurata14.bio.kyutech.ac.jp/PreAIP/. A proinflammatory peptide (PIP) is a type of signaling molecules that are secreted from immune cells, which contributes to the first line of defense against invading pathogens. Numerous experiments have shown that PIPs play an important role in human physiology such as vaccines and immunotherapeutic drugs. Considering high-throughput laboratory methods that are time consuming and costly, effective computational methods are great demand to timely and accurately identify PIPs. Thus, in this study, we proposed a computational model in conjunction with a multiple feature representation, called ProIn-Fuse, to improve the performance of PIPs identification. Specifically, a feature representation learning model was utilized to generate a set of informative probabilistic features by making the use of random forest models with eight sequence encoding schemes. Finally, the ProIn-Fuse was constructed by the linearly combined models of the informative probabilistic features. The generalization capability of our proposed method evaluated through independent test showed that ProIn-Fuse yielded an accuracy of 0.746, which was over 10% higher than those obtained by the state-of-the-art PIP predictors. Cross-validation and independent results consistently demonstrated that ProIn-Fuse is more precise and promising in the identification of PIPs than existing PIP predictors. The web server, datasets and online instruction are freely accessible at http://kurata14.bio.kyutech.ac.jp/ProIn-Fuse/. We believe that the proposed ProIn-Fuse can facilitate faster and broader applications of PIPs in drug design and development. Tuberculosis (TB) is a leading killer caused by Mycobacterium tuberculosis. Recently anti-TB peptides have provided an alternative approach to combat antibiotic tolerance. Herein, we have developed an effective computational predictor iAntiTB (identification of anti-tubercular peptides) that integrates multiple feature vectors deriving from the amino acid sequences via Random Forest (RF) and Support Vector Machine (SVM) classifiers. The iAntiTB combined the RF and SVM scores via linear regression to enhance the prediction accuracy. To make a robust and accurate predictor we prepared the two datasets with different types of negative samples. The iAntiTB achieved AUC values of 0.896 and 0.946 on the training datasets of the first and second datasets, respectively. The iAntiTB outperformed the other existing predictors. Thus, the iAntiTB is a robust and accurate predictor that is helpful for researchers working on peptide therapeutics and immunotherapy. All the employed datasets and software application are accessible at http://kurata14.bio.kyutech.ac.jp/iAntiTB/. Linear B-cell peptides are critically important for immunological applications such as vaccine design, immunodiagnostic tests, antibody production, and disease diagnosis and therapy. The accurate identification of linear B-cell peptides remains challenging despite several decades of research. In this work, we have developed a novel predictor, iLBE (Identification of B-Cell Epitope), by integrating evolutionary and sequence-based features. The successive feature vectors were optimized by a Wilcoxon rank-sum test. Then the random forest (RF) algorithm used the optimal consecutive feature vectors to predict linear B-cell epitopes. We combined the RF scores by the logistic regression to enhance the prediction accuracy. The performance of the final iLBE yielded an AUC score of 0.809 on the training dataset. It outperformed other existing prediction models on a comprehensive independent dataset. The iLBE is suggested to be a powerful computational tool to identify the linear B-cell peptides and development of penetrating diagnostic tests. A web application with curated datasets is freely accessible of iLBE at http://kurata14.bio.kyutech.ac.jp/iLBE/. Taken together, the above results suggest that our proposed predictors (PreAIP, ProIn-Fuse, iAntiTB, and iLBE) would be helpful computational resources for the prediction of anti-inflammatory, pro-inflammatory, tuberculosis, and linear B-cell peptides. / ペプチドは、癌や病原体細胞に対する免疫反応のあらゆる側面で重要な役割を果たす。ペプチドが外来の侵入物に対する免疫系を起動する上で決定的な影響を与えることは40年以上前から知られている。それ以来、免疫学におけるペプチドの生成と機能に関する新しい知見は、病気を治療するために免疫系を利用する研究を支えてきた。依然として、免疫学的洞察がほとんどないため、ワクチンを含む効率的治療法のほとんどは、経験的に開発されている。それでもなお、抗原プロセシングの生物学、ならびに抗原性および抗菌性ペプチドの化学・薬理学に関する知見の増加により、現在、薬物およびワクチンの開発が可能になっている。高度な技術により、免疫ペプチドを迅速かつ正確に予測するためのコンピュータ技術を開発することが非常に重要である。この論文では、著者は4種類の免疫ペプチド(抗炎症、炎症誘発性、抗結核、および線形B細胞エピトープ)の分類問題に対処するための機械学習アプローチに焦点を当てる。炎症性疾患および自己免疫疾患に対する治療用ペプチドは、多くの検討がなされてきた。しかし、生物学的実験による抗炎症ペプチドの探索は、多くの場合、時間と費用のかかる作業である。新しいin siloco予測器の開発は、in vitro実験に先立って、潜在的な抗炎症ペプチドを同定するために望まれている。ここでは、PreAIP(抗炎症ペプチドの予測器)と呼ばれる予測器が、複数の補完的機能を統合することによって開発された。一次配列、進化的および構造的情報を含むさまざまなタイプの特徴量を、ランダムフォレスト分類器を介して抽出した。最終的なPreAIPモデルは、10分割交差検定によるトレーニングデータセットで0.833のAUC値を達成した。これは、既存のモデルよりも優れた値である。さらに、独立の検証用データセットでAUC値0.840を達成し、提案された方法が2つの既存の予測器よりも優れていることを示した。これらの結果は、PreAIPが抗炎症ペプチドを同定するための正確な予測器であり、抗炎症ペプチド治療および生物医学研究の開発に貢献した。用いたデータセットとPreAIPは、http://kurata14.bio.kyutech.ac.jp/PreAIP/から自由に利用できる。炎症誘発性ペプチド(PIP)は、免疫細胞から分泌されるシグナル伝達分子の一種であり、侵入する病原体に対する防御の第一線を担当する。多くの実験により、PIPはワクチンや免疫療法薬などにおいて重要な役割を果たすことが示されている。ハイスループットな生物実験に時間と費用が掛かることを考えると、効率的なコンピュータ予測は、PIPを短時間にかつ正確に特定するために大きな需要がある。したがって、この研究では、PIP識別性能を向上させるために、ProIn-Fuseと呼ばれる複数の特徴表現を組み合わせた計算モデルを提案した。具体的には、特徴表現学習モデルを利用して、8つのシーケンスエンコーディングスキームを備えたランダムフォレストモデルを利用することにより、確率的予測スコアを計算した。ProIn-Fuseは、確率的予測スコアの線形結合モデルによって構築された。提案手法の汎化性能を独立したテストデータで評価した結果、ProIn-Fuseの精度は0.746であり、これは最新のPIP予測器によって得られた精度よりも10%以上高かった。テストデータによる検証結果は、ProIn-Fuseが既存のPIP予測器よりも正確にPIP識別できることを示した。Webサーバー、データセット、および説明書は、http://kurata14.bio.kyutech.ac.jp/ProIn-Fuse/から自由にアクセスできる。ProIn-Fuseは、ドラッグデザイン含む幅広いアプリケーションに応用できる。結核(TB)は、結核菌によって引き起こされる疾患である。最近、抗結核ペプチドは抗生物質耐性に対抗するための代替アプローチを提供している。ここでは、ランダムフォレスト(RF)およびサポートベクターマシン(SVM)分類器を用いてアミノ酸配列に由来する複数の特徴ベクトルを統合する効果的な予測器iAntiTB(抗結核ペプチドの識別)を開発した。iAntiTBは、線形回帰を介してRFスコアとSVMスコアを組み合わせて、予測精度を向上させた。ロバストで正確な予測器を作成するために、異なるタイプのネガティブサンプルを使用して2つのデータセットを準備した。iAntiTBは、1番目と2番目のデータセットのトレーニングデータセットでそれぞれ0.896と0.946のAUC値を達成した。iAntiTBは、他の既存の予測器の性能を上回った。このように、iAntiTBは、ペプチド治療および免疫療法に取り組んでいる研究者に役立つロバストで正確な予測器である。利用されたすべてのデータセットとソフトウェアアプリケーションは、http://kurata14.bio.kyutech.ac.jp/iAntiTB/から自由にアクセスできる。線形B細胞エピトープは、ワクチンの設計、免疫診断テスト、抗体産生、疾患の診断や治療などの免疫学的応用に非常に重要である。線形B細胞エピトープの正確な同定は、数十年の研究にもかかわらず、依然として挑戦的課題のままである。本研究では、配列の進化的特徴や物理化学的特徴等を統合することにより、新規な線形B細胞エピトープ予測モデル(iLBE)を開発した。Wilcoxon順位和検定によって最適化した特徴ベクトル群をランダムフォレスト(RF)アルゴリズムを用いて学習して、線形B細胞エピトープの予測スコアを計算した。ロジスティック回帰を用いてRFスコアを組合せて、予測精度を高めた。iLBEは、トレーニングデータセットで0.809のAUCを達成し、独立のテストデータセットを用いた検定では、既存の予測モデルの性能を超えた。線形B細胞エピトープを同定する強力な計算ツールであるiLBEは、診断テストの開発に有用である。注釈付きデータセットを備えたiLBEモデルのウエブアプリケーションは自由にアクセスできるhttp://kurata14.bio.kyutech.ac.jp/iLBE/。九州工業大学博士学位論文 学位記番号:情工博甲第358号 学位授与年月日:令和3年3月25日1 Introduction|2 Prediction of Anti-Inflammatory Peptides by Integrating Mulptle Complementary Features|3 Prediction of Proinflammatory Peptides by Fusing of Multiple Feature Representations|4 Prediction of Anti-Tubercular Peptides by Exploiting Amino Acid Pattern and Properties|5 Prediction of Linear B-Cell Epitopes by Integrating Sequence and Evolutionary Features|6 Conclusions and Perspectives九州工業大学令和2年

    Identification of Biomarkers for Esophageal Squamous Cell Carcinoma Using Feature Selection and Decision Tree Methods

    Get PDF
    Esophageal squamous cell cancer (ESCC) is one of the most common fatal human cancers. The identification of biomarkers for early detection could be a promising strategy to decrease mortality. Previous studies utilized microarray techniques to identify more than one hundred genes; however, it is desirable to identify a small set of biomarkers for clinical use. This study proposes a sequential forward feature selection algorithm to design decision tree models for discriminating ESCC from normal tissues. Two potential biomarkers of RUVBL1 and CNIH were identified and validated based on two public available microarray datasets. To test the discrimination ability of the two biomarkers, 17 pairs of expression profiles of ESCC and normal tissues from Taiwanese male patients were measured by using microarray techniques. The classification accuracies of the two biomarkers in all three datasets were higher than 90%. Interpretable decision tree models were constructed to analyze expression patterns of the two biomarkers. RUVBL1 was consistently overexpressed in all three datasets, although we found inconsistent CNIH expression possibly affected by the diverse major risk factors for ESCC across different areas

    A comprehensive review of computation-based metal-binding prediction approaches at the residue level

    Get PDF
    Clear evidence has shown that metal ions strongly connect and delicately tune the dynamic homeostasis in living bodies. They have been proved to be associated with protein structure, stability, regulation, and function. Even small changes in the concentration of metal ions can shift their effects from natural beneficial functions to harmful. This leads to degenerative diseases, malignant tumors, and cancers. Accurate characterizations and predictions of metalloproteins at the residue level promise informative clues to the investigation of intrinsic mechanisms of protein-metal ion interactions. Compared to biophysical or biochemical wet-lab technologies, computational methods provide open web interfaces of high-resolution databases and high-throughput predictors for efficient investigation of metal-binding residues. This review surveys and details 18 public databases of metal-protein binding. We collect a comprehensive set of 44 computation-based methods and classify them into four categories, namely, learning-, docking-, template-, and meta-based methods. We analyze the benchmark datasets, assessment criteria, feature construction, and algorithms. We also compare several methods on two benchmark testing datasets and include a discussion about currently publicly available predictive tools. Finally, we summarize the challenges and underlying limitations of the current studies and propose several prospective directions concerning the future development of the related databases and methods

    iAVPs-ResBi: Identifying antiviral peptides by using deep residual network and bidirectional gated recurrent unit

    Get PDF
    Human history is also the history of the fight against viral diseases. From the eradication of viruses to coexistence, advances in biomedicine have led to a more objective understanding of viruses and a corresponding increase in the tools and methods to combat them. More recently, antiviral peptides (AVPs) have been discovered, which due to their superior advantages, have achieved great impact as antiviral drugs. Therefore, it is very necessary to develop a prediction model to accurately identify AVPs. In this paper, we develop the iAVPs-ResBi model using k-spaced amino acid pairs (KSAAP), encoding based on grouped weight (EBGW), enhanced grouped amino acid composition (EGAAC) based on the N5C5 sequence, composition, transition and distribution (CTD) based on physicochemical properties for multi-feature extraction. Then we adopt bidirectional long short-term memory (BiLSTM) to fuse features for obtaining the most differentiated information from multiple original feature sets. Finally, the deep model is built by combining improved residual network and bidirectional gated recurrent unit (BiGRU) to perform classification. The results obtained are better than those of the existing methods, and the accuracies are 95.07, 98.07, 94.29 and 97.50% on the four datasets, which show that iAVPs-ResBi can be used as an effective tool for the identification of antiviral peptides. The datasets and codes are freely available at https://github.com/yunyunliang88/iAVPs-ResBi

    Prediction of S-nitrosylation Sites by Integrating Support Vector Machine and Random Forest

    Get PDF
    Cysteine S-nitrosylation is a type of reversible post-translational modification of the protein, which controls many cellular plasticity and dynamics. It is associated with redox-based cellular signaling to protect against oxidative stress and exposed various biological diseases. The identification of S-nitrosylation sites is an important step to reveal the function of proteins; however, experimental identification of S-nitrosylation is expensive and time-consuming work. The sequence-based computational prediction of potential S-nitrosylation sites is highly sought before experimentation. Herein, to identify S-nitrosylation sites, a novel predictor PreSNO has been developed that integrates multiple encoding schemes by the support vector machine and random forest. The PreSNO achieved an AUC score of 0.837 on the training model and greatly outperformed other existing computational models on comprehensive independent datasets

    A Computational Framework for Host-Pathogen Protein-Protein Interactions

    Get PDF
    Infectious diseases cause millions of illnesses and deaths every year, and raise great health concerns world widely. How to monitor and cure the infectious diseases has become a prevalent and intractable problem. Since the host-pathogen interactions are considered as the key infection processes at the molecular level for infectious diseases, there have been a large amount of researches focusing on the host-pathogen interactions towards the understanding of infection mechanisms and the development of novel therapeutic solutions. For years, the continuously development of technologies in biology has benefitted the wet lab-based experiments, such as small-scale biochemical, biophysical and genetic experiments and large-scale methods (for example yeast-two-hybrid analysis and cryogenic electron microscopy approach). As a result of past decades of efforts, there has been an exploded accumulation of biological data, which includes multi omics data, for example, the genomics data and proteomics data. Thus, an initiative review of omics data has been conducted in Chapter 2, which has exclusively demonstrated the recent update of ‘omics’ study, particularly focusing on proteomics and genomics. With the high-throughput technologies, the increasing amount of ‘omics’ data, including genomics and proteomics, has even further boosted. An upsurge of interest for data analytics in bioinformatics comes as no surprise to the researchers from a variety of disciplines. Specifically, the astonishing rate at which genomics and proteomics data are generated leads the researchers into the realm of ‘Big Data’ research. Chapter 2 is thus developed to providing an update of the omics background and the state-of-the-art developments in the omics area, with a focus on genomics data, from the perspective of big data analytics..

    Image Compression Techniques: A Survey in Lossless and Lossy algorithms

    Get PDF
    The bandwidth of the communication networks has been increased continuously as results of technological advances. However, the introduction of new services and the expansion of the existing ones have resulted in even higher demand for the bandwidth. This explains the many efforts currently being invested in the area of data compression. The primary goal of these works is to develop techniques of coding information sources such as speech, image and video to reduce the number of bits required to represent a source without significantly degrading its quality. With the large increase in the generation of digital image data, there has been a correspondingly large increase in research activity in the field of image compression. The goal is to represent an image in the fewest number of bits without losing the essential information content within. Images carry three main type of information: redundant, irrelevant, and useful. Redundant information is the deterministic part of the information, which can be reproduced without loss from other information contained in the image. Irrelevant information is the part of information that has enormous details, which are beyond the limit of perceptual significance (i.e., psychovisual redundancy). Useful information, on the other hand, is the part of information, which is neither redundant nor irrelevant. Human usually observes decompressed images. Therefore, their fidelities are subject to the capabilities and limitations of the Human Visual System. This paper provides a survey on various image compression techniques, their limitations, compression rates and highlights current research in medical image compression
    corecore