3 research outputs found

    Building an automated platform for the classification of peptides/proteins using machine learning

    Get PDF
    Dissertação de mestrado em BioinformaticsOne of the challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physico-chemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and capacity. Here, a generic modular automated platform for the classification of proteins based on their physicochemical properties using different ML algorithms is proposed. The tool developed, as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, preprocess datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. As it is modular, the user retains the power to alter the code to fit specific needs. This platform was tested to predict membrane active anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods (resulting in a total of over 20 datasets), seven ML models were trained and tested, using cross validation for error estimation and grid search for model selection. The different models, feature sets and feature selection techniques were compared. The best models obtained for distinct metric were then used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importances were also analysed. The models obtained will be useful in future research, also providing a biological insight of the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.Um dos problemas mais desafiantes em bioinformática é a caracterização de sequências, estruturas e funções de proteínas. Propriedades físico-químicas e estruturais derivadas da sequêcia proteica têm sido utilizadas no desenvolvimento de modelos de aprendizagem máquina (AM). No entanto, ferramentas para calcular estes atributos são escassas e têm limitações em termos de eficiência, facilidade de uso e capacidade de adaptação a diferentes problemas. Aqui, é descrita uma plataforma modular genérica e automatizada para a classificação de proteínas com base nas suas propriedades físico-químicas, que faz uso de diferentes algoritmos de AM. A ferramenta desenvolvida facilita as principais tarefas de AM e inclui módulos para ler e alterar sequências, calcular atributos de proteínas, realizar pré-processamento de dados, fazer redução e seleção de features, executar clustering, criar modelos de AM e fazer previsões. Como é construído de forma modular, o utilizador mantém o poder de alterar o código para atender às suas necessidades específicas. Esta plataforma foi testada com péptidos anticancerígenos e antimicrobianos e foi ainda utilizada para explorar péptidos de fusão virais. Os péptidos de fusão são uma classe de péptidos que interagem com a membrana, encontrados em vírus encapsulados e que são particularmente relevantes para a fusão da membrana do vírus com a membrana do hospedeiro. Determinar quais são as propriedades que os caracterizam é uma questão científica muito relevante, com importantes implicações tecnológicas. Usando três conjuntos de dados diferentes compostos por sequências bem anotadas, quatro técnicas diferentes de extração de features e cinco métodos diferentes de seleção de features (num total de 24 conjuntos de dados testados), sete modelos de AM, com validação cruzada de io vezes e uma abordagem de pesquisa em grelha, foram treinados e testados. Os melhores modelos obtidos, com avaliações MCC entre 0,7 e o,8 e precisão entre 0,85 e 0,9, foram utilizados para prever a localização de um péptido de fusão conhecido numa sequência da proteína de fusão do vírus do Dengue. Os modelos obtidos para prever a localização do péptido de fusão são úteis em pesquisas futuras, fornecendo também uma visão biológica das características físico-químicas distintivas dos mesmos. Este trabalho apresenta uma ferramenta disponível gratuitamente para realizar a classificação de proteínas com AM e a primeira análise global de péptidos de fusão virais usando métodos baseados em AM, reforçando a usabilidade e a importância da AM em problemas de classificação de proteínas

    ProPythia, an automated platform for the classification of peptides/proteins using machine learning

    Get PDF
    One of the most challenging problems in bioinformatics is to computationally characterize sequences, structures and functions of proteins. Sequence-derived structural and physicochemical properties of proteins have been used in the development of machine learning models in protein related problems. However, tools and platforms to calculate features and perform Machine learning (ML) with proteins are scarce and have their limitations in terms of effectiveness, user-friendliness and applicability. Here, a generic modular automated ML-based platform for the classification of proteins based on their physicochemical properties is proposed. ProPythia, developed as a Python package, facilitates the major tasks of ML and includes modules to read and alter sequences, calculate protein features, pre-process datasets, execute feature reduction and selection, perform clustering, train and optimize ML models and make predictions. This platform was validated by testing its ability to classify anticancer and antimicrobial peptides and further used to explore viral fusion peptides. Membrane-interacting peptides play a crucial role in several biological processes. Fusion peptides are a subclass found in enveloped viruses, that are particularly relevant for membrane fusion. Determining what are the properties that characterize fusion peptides and distinguishing them from other proteins is a very relevant scientific question with important technological implications. Using three different datasets composed by well annotated sequences, different feature extraction techniques and feature selection methods, ML models were trained, tested and used to predict the location of a known fusion peptide in a protein sequence from the Dengue virus. Feature importance was also analysed. The models obtained will be useful in future research, also providing a biological insight into the distinctive physicochemical characteristics of fusion peptides. This work presents a freely available tool to perform ML-based protein classification and the first global analysis and prediction of viral fusion peptides using ML, reinforcing the usability and importance of ML in protein classification problems.info:eu-repo/semantics/publishedVersio

    Molecular determinants of the SARS-CoV-2 fusion peptide activity

    Get PDF
    The COVID-19 pandemic, caused by the SARS-CoV-2 virus, emerged in late 2019 and quickly spread worldwide, resulting in over 125 million infections and 2.7 million deaths as of March 2021 accordingly to the World Health Organization. Despite the great advances achieved by the scientific community in providing crucial information about this virus, we are still far from completely understanding it. SARS-CoV-2 is an enveloped virus, meaning that it is encapsulated by a lipid membrane, which needs to be fused to the host membrane to begin the infection process. Fusion between viral and host membrane is catalyzed by the spike (S) glycoprotein. The S-protein is composed of essential elements for the infection mechanism, namely the receptor-binding domain known to bind to angiotensin-converting enzyme 2 during the viral entry pathway. Another important region, known as the fusion peptide (FP), plays an essential part in the fusion mechanism, by inserting into and disturbing the host membrane. There is still not a consensus among scientists in terms of the fusion peptide location on the S-protein sequence, with two major candidate regions having been proposed. We recently used a machine learning-based tool developed by us to identify viral FPs with accuracies over 85%. With this tool a putative FP, previously suggested in the literature, has been identified, as well as other proposals including the requirement of more than one FP. To further address this question, we are performing a systematic analysis of the SARS-CoV-2 putative FPs, using Molecular Dynamics (MD) simulations, which provide a detailed perspective of how these peptides insert and interact with the membrane. In parallel, we are characterizing these systems experimentally. Additionally we are exploring therapeutic strategies targeting these regions. Given the major role of the FP in the virus infection process, this work provides relevant insights and contributes to the fight against COVID-19.info:eu-repo/semantics/publishedVersio
    corecore