4,347 research outputs found

    Epitope prediction improved by multitask support vector machines

    Full text link
    Motivation: In silico methods for the prediction of antigenic peptides binding to MHC class I molecules play an increasingly important role in the identification of T-cell epitopes. Statistical and machine learning methods, in particular, are widely used to score candidate epitopes based on their similarity with known epitopes and non epitopes. The genes coding for the MHC molecules, however, are highly polymorphic, and statistical methods have difficulties to build models for alleles with few known epitopes. In this case, recent works have demonstrated the utility of leveraging information across alleles to improve the performance of the prediction. Results: We design a support vector machine algorithm that is able to learn epitope models for all alleles simultaneously, by sharing information across similar alleles. The sharing of information across alleles is controlled by a user-defined measure of similarity between alleles. We show that this similarity can be defined in terms of supertypes, or more directly by comparing key residues known to play a role in the peptide-MHC binding. We illustrate the potential of this approach on various benchmark experiments where it outperforms other state-of-the-art methods

    Application of association rule and support vector machine technique for T - cell epitope prediction

    Get PDF
    Istraživanje podataka (eng. Data mining)je interdisciplinarno polje informatike koje se bavi automatskim ili polu-automatskim otkrivanjem znanja u podacima. Osnovni zadatak istraživanja podataka je izdvajanje netrivijalnih, prethodno nepoznatih i potencijalno korisnih obrazaca, odnosa i veza u podacima i statistički značajnih struktura iz velikih kolekcija podataka. Imperativ je da dobijeni rezultati budu novi, valjani, korisni i razumljivi. Tehnike za istraživanje podataka uključuju statističke modele, matematičke algoritme i metode mašinskog učenja...Data mining is an interdisciplinary subfield of computer science, including various scientific disciplines such as: database systems, statistics, machine learning, artificial intelligence and the others. The main task of data mining is automatic and semi-automatic analysis of large quantities of data to extract previously unknown, nontrivial and interesting patterns. Rapid development in the fields of immunology, genomics, proteomics, molecular biology and other related areas has caused a large increase in biological data. Drawing conclusions from these data requires sophisticated computational analyses. Without automatic methods to extract data it is almost impossible to investigate and analyze this data. Currently, one of the most active problems in immunoinformatics is T - cell epitope identification. Identification of T - cell epitopes, especially dominant T - cell epitopes widely represented in population, is of the immense relevance in vaccine development and detecting immunological patterns characteristic for autoimmune diseases. Epitope-based vaccines are of great importance in combating infectious and chronic diseases and various types of cancer. Experimental methods for identification of T - cell epitopes are expensive, time consuming, and are not applicable for large scale research (especially not for the choice of the optimal group of epitopes for vaccine development which will cover the whole population or personalized vaccines). Computational and mathematical models for T - cell epitope prediction, based on MHC-peptide binding, are crucial to enable the systematic investigation and identification of T - cell epitopes on a large dataset and to complement expensive and time consuming experimentation [16]. T - cells (T - lymphocytes) recognize protein antigen(s) only when degradated to peptide fragments and complexed with Major Histocompatibility Complex (MHC) molecules on the surface of antigen-presenting cells [1]. The binding of these peptides (potential epitopes) to MHC molecules and presentation to T - cells is a crucial (and the most selective) step in both cellular and humoral adoptive immunity. Currently exist numerous of methodologies that provide identification of these epitopes. In this PhD thesis, discussed methods are exclusively based on peptide sequence binding to MHC molecules. It describes existing methodologies for T - cell epitope prediction, the shortcomings of existing methods and some of the available databases of experimentally determined linear T - cell epitopes. The new models for T - cell epitope prediction using data mining techniques are developed and extensive analyses concerning to whether disorder and hydropathy prediction methods could help understanding epitope processing and presentation is done. Accurate computational prediction of T cell epitope, which is the aim of this thesis, can greatly expedite epitope screening by reducing costs and experimental effort. These theses deals with predictive data mining tasks: classification and regression, and descriptive data mining tasks: clustering, association rules and sequence analysis. The new-developed models, which are main contribution of the dissertation are comparable in performance with the best currently existing methods, and even better in some cases. Developed models are based on the support vector machine technique for classification and regression problems. A new approach of extracting the most important physicochemical properties that influence the classification of MHC-binding ligands is also presented. For that purpose are developed new clustering-based classification models. The models are based on k-means clustering technique. The second part of the thesis concerns the establishment of rules and associations of T - cell epitopes that belong to different protein structures. The task of this part of research was to find out whether disorder and hydropathy prediction methods could help in understanding epitope processing and presentation. The results of the application of an association rule technique and thorough analysis over large protein dataset where T cell epitopes, protein structure and hydropathy has been determined computationally, using publicly available tools, are presented. During the research on this theses new extendable open source software system that support bioinformatic research and have wide applications in prediction of various proteins characteristics is developed. A part of this thesis is described in the works [71][82][45][42][43][44][72][73] that are published or submitted for publications in several journals. The dissertation is organized as follows: In section1 is illustrated introduction to the problem of identifying T - cell epitopes, the importance of mathematical and computational methods in this area, as well as the importance of T - cell epitopes to the immune system and basis for functioning of the immune system. In section 2 are described in details data mining techniques that are used in the thesis for development of new models. Section 3 provides an overview of existing methods for predicting the T - cell epitopes and explains the work methodologies of existing models and methods. It pointed out the shortcomings of existing methods which have been the motivation for the development of new models for the T - cell epitope prediction. Some of the publicly available databases with the experimentally determined MHC binding peptides and T - cell epitope are described. In section 4 are presented new developed models for epitopes prediction. The developed models include three new encoding schemes for peptide sequences representation in the form of a vector which is more suitable as input to models based on the data mining techniques. Section 5 reports results of presented new classification and regression models. The new models are compared with each other as well as with currently existing methods for T cell epitope prediction. Section 6 presents the research results of the T - cell epitopes relationship with ordered and disordered regions in proteins. In the context of this chapter summary results are presented which are shown in more detail in the published works [71][82][45][44]. Section 7 concludes the dissertation with some discussion of the potential significance of obtained results and some directions for future work

    Application of association rule and support vector machine technique for T - cell epitope prediction

    Get PDF
    Истраживање података (eng. Data mining)је интердисциплинарно поље информатике које се бави аутоматским или полу-аутоматским откривањем знања у подацима. Основни задатак истраживања података је издвајање нетривијалних, претходно непознатих и потенцијално корисних образаца, односа и веза у подацима и статистички значајних структура из великих колекција података. Императив је да добијени резултати буду нови, ваљани, корисни и разумљиви. Технике за истраживање података укључују статистичке моделе, математичке алгоритме и методе машинског учења...Data mining is an interdisciplinary subfield of computer science, including various scientific disciplines such as: database systems, statistics, machine learning, artificial intelligence and the others. The main task of data mining is automatic and semi-automatic analysis of large quantities of data to extract previously unknown, nontrivial and interesting patterns. Rapid development in the fields of immunology, genomics, proteomics, molecular biology and other related areas has caused a large increase in biological data. Drawing conclusions from these data requires sophisticated computational analyses. Without automatic methods to extract data it is almost impossible to investigate and analyze this data. Currently, one of the most active problems in immunoinformatics is T - cell epitope identification. Identification of T - cell epitopes, especially dominant T - cell epitopes widely represented in population, is of the immense relevance in vaccine development and detecting immunological patterns characteristic for autoimmune diseases. Epitope-based vaccines are of great importance in combating infectious and chronic diseases and various types of cancer. Experimental methods for identification of T - cell epitopes are expensive, time consuming, and are not applicable for large scale research (especially not for the choice of the optimal group of epitopes for vaccine development which will cover the whole population or personalized vaccines). Computational and mathematical models for T - cell epitope prediction, based on MHC-peptide binding, are crucial to enable the systematic investigation and identification of T - cell epitopes on a large dataset and to complement expensive and time consuming experimentation [16]. T - cells (T - lymphocytes) recognize protein antigen(s) only when degradated to peptide fragments and complexed with Major Histocompatibility Complex (MHC) molecules on the surface of antigen-presenting cells [1]. The binding of these peptides (potential epitopes) to MHC molecules and presentation to T - cells is a crucial (and the most selective) step in both cellular and humoral adoptive immunity. Currently exist numerous of methodologies that provide identification of these epitopes. In this PhD thesis, discussed methods are exclusively based on peptide sequence binding to MHC molecules. It describes existing methodologies for T - cell epitope prediction, the shortcomings of existing methods and some of the available databases of experimentally determined linear T - cell epitopes. The new models for T - cell epitope prediction using data mining techniques are developed and extensive analyses concerning to whether disorder and hydropathy prediction methods could help understanding epitope processing and presentation is done. Accurate computational prediction of T cell epitope, which is the aim of this thesis, can greatly expedite epitope screening by reducing costs and experimental effort. These theses deals with predictive data mining tasks: classification and regression, and descriptive data mining tasks: clustering, association rules and sequence analysis. The new-developed models, which are main contribution of the dissertation are comparable in performance with the best currently existing methods, and even better in some cases. Developed models are based on the support vector machine technique for classification and regression problems. A new approach of extracting the most important physicochemical properties that influence the classification of MHC-binding ligands is also presented. For that purpose are developed new clustering-based classification models. The models are based on k-means clustering technique. The second part of the thesis concerns the establishment of rules and associations of T - cell epitopes that belong to different protein structures. The task of this part of research was to find out whether disorder and hydropathy prediction methods could help in understanding epitope processing and presentation. The results of the application of an association rule technique and thorough analysis over large protein dataset where T cell epitopes, protein structure and hydropathy has been determined computationally, using publicly available tools, are presented. During the research on this theses new extendable open source software system that support bioinformatic research and have wide applications in prediction of various proteins characteristics is developed. A part of this thesis is described in the works [71][82][45][42][43][44][72][73] that are published or submitted for publications in several journals. The dissertation is organized as follows: In section1 is illustrated introduction to the problem of identifying T - cell epitopes, the importance of mathematical and computational methods in this area, as well as the importance of T - cell epitopes to the immune system and basis for functioning of the immune system. In section 2 are described in details data mining techniques that are used in the thesis for development of new models. Section 3 provides an overview of existing methods for predicting the T - cell epitopes and explains the work methodologies of existing models and methods. It pointed out the shortcomings of existing methods which have been the motivation for the development of new models for the T - cell epitope prediction. Some of the publicly available databases with the experimentally determined MHC binding peptides and T - cell epitope are described. In section 4 are presented new developed models for epitopes prediction. The developed models include three new encoding schemes for peptide sequences representation in the form of a vector which is more suitable as input to models based on the data mining techniques. Section 5 reports results of presented new classification and regression models. The new models are compared with each other as well as with currently existing methods for T cell epitope prediction. Section 6 presents the research results of the T - cell epitopes relationship with ordered and disordered regions in proteins. In the context of this chapter summary results are presented which are shown in more detail in the published works [71][82][45][44]. Section 7 concludes the dissertation with some discussion of the potential significance of obtained results and some directions for future work

    Recent advances in B-cell epitope prediction methods

    Get PDF
    Identification of epitopes that invoke strong responses from B-cells is one of the key steps in designing effective vaccines against pathogens. Because experimental determination of epitopes is expensive in terms of cost, time, and effort involved, there is an urgent need for computational methods for reliable identification of B-cell epitopes. Although several computational tools for predicting B-cell epitopes have become available in recent years, the predictive performance of existing tools remains far from ideal. We review recent advances in computational methods for B-cell epitope prediction, identify some gaps in the current state of the art, and outline some promising directions for improving the reliability of such methods

    Prediction of CTL epitopes using QM, SVM and ANN techniques

    Get PDF
    Cytotoxic T lymphocyte (CTL) epitopes are potential candidates for subunit vaccine design for various diseases. Most of the existing T cell epitope prediction methods are indirect methods that predict MHC class I binders instead of CTL epitopes. In this study, a systematic attempt has been made to develop a direct method for predicting CTL epitopes from an antigenic sequence. This method is based on quantitative matrix (QM) and machine learning techniques such as Support Vector Machine (SVM) and Artificial Neural Network (ANN). This method has been trained and tested on non-redundant dataset of T cell epitopes and non-epitopes that includes 1137 experimentally proven MHC class I restricted T cell epitopes. The accuracy of QM-, ANN- and SVM-based methods was 70.0, 72.2 and 75.2%, respectively. The performance of these methods has been evaluated through Leave One Out Cross-Validation (LOOCV) at a cutoff score where sensitivity and specificity was nearly equal. Finally, both machine-learning methods were used for consensus and combined prediction of CTL epitopes. The performances of these methods were evaluated on blind dataset where machine learning-based methods perform better than QM-based method. We also demonstrated through subgroup analysis that our methods can discriminate between T-cell epitopes and MHC binders (non-epitopes). In brief this method allows prediction of CTL epitopes using QM, SVM, ANN approaches. The method also facilitates prediction of MHC restriction in predicted T cell epitopes. The method is available at http://www.imtech.res.in/raghava/ctlpred/

    ProInflam: a webserver for the prediction of proinflammatory antigenicity of peptides and proteins

    Get PDF
    Additional file 2: Table S2. Dipeptide composition distribution between proinflammatory and non proinflammatory data

    Enhancing the Biological Relevance of Machine Learning Classifiers for Reverse Vaccinology.

    Get PDF
    Reverse vaccinology (RV) is a bioinformatics approach that can predict antigens with protective potential from the protein coding genomes of bacterial pathogens for subunit vaccine design. RV has become firmly established following the development of the BEXSERO® vaccine against Neisseria meningitidis serogroup B. RV studies have begun to incorporate machine learning (ML) techniques to distinguish bacterial protective antigens (BPAs) from non-BPAs. This research contributes significantly to the RV field by using permutation analysis to demonstrate that a signal for protective antigens can be curated from published data. Furthermore, the effects of the following on an ML approach to RV were also assessed: nested cross-validation, balancing selection of non-BPAs for subcellular localization, increasing the training data, and incorporating greater numbers of protein annotation tools for feature generation. These enhancements yielded a support vector machine (SVM) classifier that could discriminate BPAs (n = 200) from non-BPAs (n = 200) with an area under the curve (AUC) of 0.787. In addition, hierarchical clustering of BPAs revealed that intracellular BPAs clustered separately from extracellular BPAs. However, no immediate benefit was derived when training SVM classifiers on data sets exclusively containing intra- or extracellular BPAs. In conclusion, this work demonstrates that ML classifiers have great utility in RV approaches and will lead to new subunit vaccines in the future
    corecore