1,196 research outputs found

    Support Vector Machine-based Fuzzy Systems for Quantitative Prediction of Peptide Binding Affinity

    Get PDF
    Reliable prediction of binding affinity of peptides is one of the most challenging but important complex modelling problems in the post-genome era due to the diversity and functionality of the peptides discovered. Generally, peptide binding prediction models are commonly used to find out whether a binding exists between a certain peptide(s) and a major histocompatibility complex (MHC) molecule(s). Recent research efforts have been focused on quantifying the binding predictions. The objective of this thesis is to develop reliable real-value predictive models through the use of fuzzy systems. A non-linear system is proposed with the aid of support vector-based regression to improve the fuzzy system and applied to the real value prediction of degree of peptide binding. This research study introduced two novel methods to improve structure and parameter identification of fuzzy systems. First, the support-vector based regression is used to identify initial parameter values of the consequent part of type-1 and interval type-2 fuzzy systems. Second, an overlapping clustering concept is used to derive interval valued parameters of the premise part of the type-2 fuzzy system. Publicly available peptide binding affinity data sets obtained from the literature are used in the experimental studies of this thesis. First, the proposed models are blind validated using the peptide binding affinity data sets obtained from a modelling competition. In that competition, almost an equal number of peptide sequences in the training and testing data sets (89, 76, 133 and 133 peptides for the training and 88, 76, 133 and 47 peptides for the testing) are provided to the participants. Each peptide in the data sets was represented by 643 bio-chemical descriptors assigned to each amino acid. Second, the proposed models are cross validated using mouse class I MHC alleles (H2-Db, H2-Kb and H2-Kk). H2-Db, H2-Kb, and H2-Kk consist of 65 nona-peptides, 62 octa-peptides, and 154 octa-peptides, respectively. Compared to the previously published results in the literature, the support vector-based type-1 and support vector-based interval type-2 fuzzy models yield an improvement in the prediction accuracy. The quantitative predictive performances have been improved as much as 33.6\% for the first group of data sets and 1.32\% for the second group of data sets. The proposed models not only improved the performance of the fuzzy system (which used support vector-based regression), but the support vector-based regression benefited from the fuzzy concept also. The results obtained here sets the platform for the presented models to be considered for other application domains in computational and/or systems biology. Apart from improving the prediction accuracy, this research study has also identified specific features which play a key role(s) in making reliable peptide binding affinity predictions. The amino acid features "Polarity", "Positive charge", "Hydrophobicity coefficient", and "Zimm-Bragg parameter" are considered as highly discriminating features in the peptide binding affinity data sets. This information can be valuable in the design of peptides with strong binding affinity to a MHC I molecule(s). This information may also be useful when designing drugs and vaccines

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Overlapping Clusters and Support Vector Machines Based Interval Type-2 Fuzzy System for the Prediction of Peptide Binding Affinity

    Get PDF
    In the post-genome era, it is becoming more complex to process high dimensional, low-instance available, and nonlinear biological datasets. This paper aims to address these characteristics as they have adverse effects on the performance of predictive models in bioinformatics. In this paper, an interval type-2 Takagi Sugeno fuzzy predictive model is proposed in order to manage high-dimensionality and nonlinearity of such datasets which is the common feature in bioinformatics. A new clustering framework is proposed for this purpose to simplify antecedent operations for an interval type-2 fuzzy system. This new clustering framework is based on overlapping regions between the clusters. The cluster analysis of partitions and statistical information derived from them has identified the upper and lower membership functions forming the premise part. This is further enhanced by adapting the regression version of support vector machines in the consequent part. The proposed method is used in experiments to quantitatively predict affinities of peptide bindings to biomolecules. This case study imposes a challenge in post-genome studies and remains an open problem due to the complexity of the biological system, diversity of peptides, and curse of dimensionality of amino acid index representation characterizing the peptides. Utilizing four different peptide binding affinity datasets, the proposed method resulted in better generalization ability for all of them yielding an improved prediction accuracy of up to 58.2% on unseen peptides in comparison with the predictive methods presented in the literature. Source code of the algorithm is available at https://github.com/sekerbigdatalab

    AmorProt: Amino Acid Molecular Fingerprints Repurposing based Protein Fingerprint

    Full text link
    As protein therapeutics play an important role in almost all medical fields, numerous studies have been conducted on proteins using artificial intelligence. Artificial intelligence has enabled data driven predictions without the need for expensive experiments. Nevertheless, unlike the various molecular fingerprint algorithms that have been developed, protein fingerprint algorithms have rarely been studied. In this study, we proposed the amino acid molecular fingerprints repurposing based protein (AmorProt) fingerprint, a protein sequence representation method that effectively uses the molecular fingerprints corresponding to 20 amino acids. Subsequently, the performances of the tree based machine learning and artificial neural network models were compared using (1) amyloid classification and (2) isoelectric point regression. Finally, the applicability and advantages of the developed platform were demonstrated through a case study and the following experiments: (3) comparison of dataset dependence with feature based methods; (4) feature importance analysis; and (5) protein space analysis. Consequently, the significantly improved model performance and data set independent versatility of the AmorProt fingerprint were verified. The results revealed that the current protein representation method can be applied to various fields related to proteins, such as predicting their fundamental properties or interaction with ligands

    Building an automated platform for the classification of peptides/proteins using machine learning

    Get PDF
    Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformática é a caracterização de sequências, estruturas e funções de proteínas. Propriedades físico-químicas e estruturais derivadas da sequêcia proteica têm sido utilizadas no desenvolvimento de modelos de aprendizagem máquina (AM). No entanto, ferramentas para calcular estes atributos são escassas e têm limitações em termos de eficiência, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, é descrita uma plataforma modular genérica e automatizada para a classificação de proteínas com base nas suas propriedades físico-químicas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui módulos para ler e alterar sequências, calcular atributos de proteínas, realizar pré-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsões. Como é construído de forma modular, o utilizador mantém o poder de alterar o código para atender às suas necessidades específicas. Esta plataforma foi testada com péptidos anticancerígenos e antimicrobianos e foi ainda utilizada para explorar péptidos de fusão virais. Os péptidos de fusão são uma classe de péptidos que interagem com a membrana, encontrados em vírus encapsulados e que são particularmente relevantes para a fusão da membrana do vírus com a membrana do hospedeiro. Determinar quais são as propriedades que os caracterizam é uma questão científica muito relevante, com importantes implicações tecnológicas. Usando três conjuntos de dados diferentes compostos por sequências bem anotadas, quatro técnicas diferentes de extração de features e cinco métodos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliações MCC entre 0,7 e o,8 e precisão entre 0,85 e 0,9, foram utilizados para prever a localização de um péptido de fusão conhecido numa sequência da proteína de fusão do vírus do Dengue. Os modelos obtidos para prever a localização do péptido de fusão são úteis em pesquisas futuras, fornecendo também uma visão biológica das características físico-químicas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponível gratuitamente para realizar a classificação de proteínas com AM e a primeira análise global de péptidos de fusão virais usando métodos baseados em AM, reforçando a usabilidade e a importância da AM em problemas de classificação de proteínas

    Overlapping Clusters and Support Vector Machines based Interval Type-2 Fuzzy System for the Prediction of Peptide Binding Affinity

    Get PDF
    In the post-genome era, it is becoming more complex to process high-dimensional, low-instance available and nonlinear biological datasets. This study aims at addressing these characteristics as they have adverse effects on the performance of predictive models in bioinformatics. In this paper, an interval type-2 Takagi Sugeno fuzzy predictive model is proposed in order to manage high-dimensionality and nonlinearity of such datasets which is the common feature in bioinformatics. A new clustering framework is proposed for this purpose to simplify antecedent operations for an interval type-2 fuzzy system. This new clustering framework is based on overlapping regions between the clusters. The cluster analysis of partitions and statistical information derived from them have identified the upper and lower membership functions forming the premise part. This is further enhanced by adapting the regression version of support vector machines in the consequent part. The proposed method is used in experiments to quantitatively predict affinities of peptide bindings to biomolecules. This case study imposes a challenge in post-genome studies and remains an open problem due to the complexity of the biological system, diversity of peptides and curse of dimensionality of amino acid index representation characterising the peptides. Utilizing four different peptide binding affinity datasets, the proposed method resulted in better generalisation ability for all of them yielding an improved prediction accuracy of up to 58.2% on unseen peptides in comparison with the predictive methods presented in the literature

    Quantitative approaches for decoding the specificity of the human T cell repertoire

    Get PDF
    T cell receptor (TCR)-peptide-major histocompatibility complex (pMHC) interactions play a vital role in initiating immune responses against pathogens, and the specificity of TCRpMHC interactions is crucial for developing optimized therapeutic strategies. The advent of high-throughput immunological and structural evaluation of TCR and pMHC has provided an abundance of data for computational approaches that aim to predict favorable TCR-pMHC interactions. Current models are constructed using information on protein sequence, structures, or a combination of both, and utilize a variety of statistical learning-based approaches for identifying the rules governing specificity. This review examines the current theoretical, computational, and deep learning approaches for identifying TCR-pMHC recognition pairs, placing emphasis on each method’s mathematical approach, predictive performance, and limitations

    Characterizing protein-ligand binding using atomistic simulation and machine learning: Application to drug resistance in HIV-1 protease

    Get PDF
    Over the past several decades, atomistic simulations of biomolecules, whether carried out using molecular dynamics or Monte Carlo techniques, have provided detailed insights into their function. Comparing the results of such simulations for a few closely related systems has guided our understanding of the mechanisms by which changes like ligand binding or mutation can alter function. The general problem of detecting and interpreting such mechanisms from simulations of many related systems, however, remains a challenge. This problem is addressed here by applying supervised and unsupervised machine learning techniques to a variety of thermodynamic observables extracted from molecular dynamics simulations of different systems. As an important test case, these methods are applied to understanding the evasion by HIV-1 protease of darunavir, a potent inhibitor to which resistance can develop via the simultaneous mutation of multiple amino acids. Complex mutational patterns have been observed among resistant strains, presenting a challenge to developing a mechanistic picture of resistance in the protease. In order to dissect these patterns and gain mechanistic insight on the role of specific mutations, molecular dynamics simulations were carried out on a collection of HIV-1 protease variants, chosen to include highly resistant strains and susceptible controls, in complex with darunavir. Using a machine learning approach that takes advantage of the hierarchical nature in the relationships among sequence, structure and function, an integrative analysis of these trajectories reveals key details of the resistance mechanism, including changes in protein structure, hydrogen bonding and protein-ligand contacts
    corecore