1,552 research outputs found
Recommended from our members
Modelling the structural, functional and phenotypic consequences of protein coding mutations
Proteins are integral to all cellular processes and underpin the function of all extant organisms, meaning variants impacting them are a primary cause of phenotypic variation. Protein coding variants are a key area of study in biology, with relevance from structural and molecular biology to population genetics. They are also medically important, impacting inherited genetic diseases, cancer and response to pathogens. Recent advances in highthroughput experimental techniques have opened the door to many new approaches in biology, and protein variants are no exception. Deep mutational scanning experiments exhaustively measure the fitness of variants in a protein, which gives us more experimentally validated mutational consequence measurements than ever before. Such advances, together with ever larger sequence and structure databases, have created an opportunity to apply large scale analyses to coding variation, studying the effect on protein structure, function and phenotype.
In this thesis I perform three large scale variant analyses. First, I use the consequences of variation to learn about protein structure and function. I compile a dataset from 28 deep mutational scanning studies, covering 6291 positions in 30 proteins, and use the consequences of mutation at each position to define a mutational landscape. I show rich biophysical relationships in this landscape and identify functionally distinct positional subtypes of each amino acid. In the second analysis, I explore genotype to phenotype prediction using a dataset of 1011 S. cerevisiae strains, with genotypes, transcriptomics, proteomics and measured phenotypes, and comprehensive gene deletions in four strains. I show knowledge-based
models of mutational consequences and pathway function can be used to associate genes with phenotypes and predict growth phenotypes across 34 growth conditions. However, genetic background is found to have a large effect on variant consequences, to such an extent that the same deletion can be highly significant in one strain and have no effect in another. Finally, I analyse computational variant effect prediction, benchmarking current predictors using deep mutational scanning data. I then develop a new end-to-end deep convolutional neural network predictor that predicts consequences directly from sequence and structure and show it improves on current methods. Together these projects advance our knowledge of protein coding variation and enhance our capacity to link variation to impacts on structure, function and phenotype
Machine learning and mapping algorithms applied to proteomics problems
Proteins provide evidence that a given gene is expressed, and machine learning algorithms can be applied to various proteomics problems in order to gain information about the underlying biology. This dissertation applies machine learning algorithms to proteomics data in order to predict whether or not a given peptide is observable by mass spectrometry, whether a given peptide can serve as a cell penetrating peptide, and then utilizes the peptides observed through mass spectrometry to aid in the structural annotation of the chicken genome. Peptides observed by mass spectrometry are used to identify proteins, and being able to accurately predict which peptides will be seen can allow researchers to analyze to what extent a given protein is observable. Cell penetrating peptides can possibly be utilized to allow targeted small molecule delivery across cellular membranes and possibly serve a role as drug delivery peptides. Peptides and proteins identified through mass spectrometry can help refine computational gene models and improve structural genome annotations
Machine learning for the prediction of protein-protein interactions
The prediction of protein-protein interactions (PPI) has recently emerged as an important problem in the fields of bioinformatics and systems biology, due to the fact that most essential cellular processes are mediated by these kinds of interactions. In this thesis we focussed in the prediction of co-complex interactions, where the objective is to identify and characterize protein pairs which are members of the same protein complex.
Although high-throughput methods for the direct identification of PPI have been developed in the last years. It has been demonstrated that the data obtained by these methods is often incomplete and suffers from high false-positive and false-negative rates. In order to deal with this technology-driven problem, several machine learning techniques have been employed in the past to improve the accuracy and trustability of predicted protein interacting pairs, demonstrating that the combined use of direct and indirect biological insights can improve the quality of predictive PPI models. This task has been commonly viewed as a binary classification problem. However, the nature of the data creates two major problems. Firstly, the imbalanced class problem due to the number of positive examples (pairs of proteins which really interact) being much smaller than the number of negative ones. Secondly, the selection of negative examples is based on some unreliable assumptions which could introduce some bias in the classification results.
The first part of this dissertation addresses these drawbacks by exploring the use of one-class classification (OCC) methods to deal with the task of prediction of PPI. OCC methods utilize examples of just one class to generate a predictive model which is consequently independent of the kind of negative examples selected; additionally these approaches are known to cope with imbalanced class problems. We designed and carried out a performance evaluation study of several OCC methods for this task. We also undertook a comparative performance evaluation with several conventional learning techniques.
Furthermore, we pay attention to a new potential drawback which appears to affect the performance of PPI prediction. This is associated with the composition of the positive gold standard set, which contain a high proportion of examples associated with interactions of ribosomal proteins. We demonstrate that this situation indeed biases the classification task, resulting in an over-optimistic performance result. The prediction of non-ribosomal PPI is a much more difficult task. We investigate some strategies in order to improve the performance of this subtask, integrating new kinds of data as well as combining diverse classification models generated from different sets of data.
In this thesis, we undertook a preliminary validation study of the new PPI predicted by using OCC methods. To achieve this, we focus in three main aspects: look for biological evidence in the literature that support the new predictions; the analysis of predicted PPI networks properties; and the identification of highly interconnected groups of proteins which can be associated with new protein complexes.
Finally, this thesis explores a slightly different area, related to the prediction of PPI types. This is associated with the classification of PPI structures (complexes) contained in the Protein Data Bank (PDB) data base according to its function and binding affinity. Considering the relatively reduced number of crystalized protein complexes available, it is not possible at the moment to link these results with the ones obtained previously for the prediction of PPI complexes. However, this could be possible in the near future when more PPI structures will be available
Risk assessment for progression of Diabetic Nephropathy based on patient history analysis
A nefropatia diabética (ND) é uma das complicações mais comuns em doentes com
diabetes. Trata-se de uma doença crónica que afeta progressivamente os rins,
podendo resultar numa insuficiência renal. A digitalização permitiu aos hospitais
armazenar as informações dos doentes em registos de saúde eletrónicos (RSE). A
aplicação de algoritmos de Machine Learning (ML) a estes dados pode permitir a
previsão do risco na evolução destes doentes, conduzindo a uma melhor gestão da
doença. O principal objetivo deste trabalho é criar um modelo preditivo que tire
partido do historial do doente presente nos RSE. Foi aplicado neste trabalho o maior
conjunto de dados de doentes portugueses com DN, seguidos durante 22 anos pela
Associação Protetora dos Diabéticos de Portugal (APDP). Foi desenvolvida uma
abordagem longitudinal na fase de pré-processamento de dados, permitindo que
estes fossem servidos como entrada para dezasseis algoritmos de ML distintos. Após
a avaliação e análise dos respetivos resultados, o Light Gradient Boosting Machine
foi identificado como o melhor modelo, apresentando boas capacidades de previsão.
Esta conclusão foi apoiada não só pela avaliação de várias métricas de classificação
em dados de treino, teste e validação, mas também pela avaliação do seu
desempenho por cada estádio da doença. Para além disso, os modelos foram
analisados utilizando gráficos de feature ranking e através de análise estatística.
Como complemento, são ainda apresentados a interpretabilidade dos resultados
através do método SHAP, assim como a distribuição do modelo utilizando o Gradio
e os servidores da Hugging Face. Através da integração de técnicas ML, de um
método de interpretação e de uma aplicação Web que fornece acesso ao modelo,
este estudo oferece uma abordagem potencialmente eficaz para antecipar a evolução
da ND, permitindo que os profissionais de saúde tomem decisões informadas para
a prestação de cuidados personalizados e gestão da doença
Seleção de embriões pela análise de imagens: uma abordagem Deep Learning
Infertility affects about 186 million people worldwide and 9-10% of couples in Portugal, causing financial, social and medical problems. Evaluation of embryo quality based morphological features is the standard in vitro fertilization (IVF) clinics around the world. This process is subjective and time-consuming, and results in discrepant classifications among embryologists and clinics, leading to fail in predict accurately embryo implantation and live birth potential. Although assisted reproductive technologies (ART) such as IVF coupled with time lapse elimination of periodic transfer to microscopy assessment and stable embryo culture conditions for embryos development, has alleviated the infertility problem, there are significant limitations even considering morphokinetic analysis. Likewise, many patients require multiple IVF cycles to achieve pregnancy, making the selection of single embryo for transfer a critical challenge. Here, we demonstrate the reliability of machine learning, especially deep learning based on TensorFlow open source and Keras libraries for embryo raw TLI images features extraction and classification in clinical practice. Equally, we present a follow up pipeline for clinicians and researchers, with no expertise in machine learning, to easily, rapid and accurately utilize deep learning as a clinical decision support tool in embryos viability studies, as well in other medical field where the analysis of images is preeminentA infertilidade afeta cerca de 186 milhões de pessoas em todo o mundo e 9-10% dos casais em Portugal, causando problemas financeiros, sociais e de saúde. Constitui procedimento padrão a avaliação da qualidade dos embriões baseadas em características morfológicas. No entanto, tais avaliações são subjetivas e demoradas e resultam em classificações discrepantes entre embriologistas e clínicas causando problemas na avaliação do potencial do embrião. Embora as tecnologias de reprodução medicamente assistida, como a fertilização in vitro, acoplada à tecnologia time-lapse, tenham diminuído o problema da infertilidade, existem limitações significativas, mesmo considerando a análise morfocinética. Outrossim, muitas pacientes necessitam de múltiplos ciclos de fertilização para alcançar a gravidez, tornando a seleção do embrião com maior potencial de implantação e geração de nados vivos um desafio crítico. No presente projeto demonstramos a prova do conceito da confiabilidade de Machine Learning (aprendizagem automática), especialmente Deep Learning baseado em TensorFlow e Keras, para extrair e discriminar caraterísticas associadas ao potencial embrionário, em imagens time-lapse. Igualmente, apresentamos um pipeline para que clínicos e investigadores, sem experiência em Machine Learning, possam utilizar com facilidade, rapidez e precisão Deep Learning como ferramenta de apoio à decisão clínica em estudos de viabilidade de embriões, bem como noutras áreas médicas onde a análise de imagens seja proeminenteMestrado em Biologia Molecular e Celula
Understanding the functional roles of Intrinsic Protein disorder in NFkB Transcription factors
Master'sMASTER OF SCIENC
- …