617 research outputs found

    SNARE-CNN: a 2D convolutional neural network architecture to identify SNARE proteins from high-throughput sequencing data

    Get PDF
    Deep learning has been increasingly and widely used to solve numerous problems in various fields with state-of-the-art performance. It can also be applied in bioinformatics to reduce the requirement for feature extraction and reach high performance. This study attempts to use deep learning to predict SNARE proteins, which is one of the most vital molecular functions in life science. A functional loss of SNARE proteins has been implicated in a variety of human diseases (e.g., neurodegenerative, mental illness, cancer, and so on). Therefore, creating a precise model to identify their functions is a crucial problem for understanding these diseases, and designing the drug targets. Our SNARE-CNN model which uses two-dimensional convolutional neural networks and position-specific scoring matrix profiles could identify SNARE proteins with achieved sensitivity of 76.6%, specificity of 93.5%, accuracy of 89.7%, and MCC of 0.7 in cross-validation dataset. We also evaluate the performance of our model via an independent dataset and the result shows that we are able to solve the overfitting problem. Compared with other state-of-the-art methods, this approach achieved significant improvement in all of the metrics. Throughout the proposed study, we provide an effective model for identifying SNARE proteins and a basis for further research that can apply deep learning in bioinformatics, especially in protein function prediction. SNARE-CNN are freely available at https://github.com/khanhlee/snare-cnn

    APPLICATION OF MACHINE LEARNING APPROACHES TO EMPOWER DRUG DEVELOPMENT

    Get PDF
    Human health, one of the major topics in Life Science, is facing intensified challenges, including cancer, pandemic outbreaks, and antimicrobial resistance. Thus, new medicines with unique advantages, including peptide-based vaccines and permeable small molecule antimicrobials, are in urgent need. However, the drug development process is long, complex, and risky with no guarantee of success. Also, the improvements in techniques applied in genomics, proteomics, computational biology, and clinical trials significantly increase the data complexity and volume, which imposes higher requirements on the drug development pipeline. In recent years, machine learning (ML) methods were employed to support drug development in various aspects and were shown to be highly effective. Here, we explored the application of advanced ML approaches to empower the development of peptide-based vaccines and permeable antimicrobials. First, the peptide-based vaccines targeting pancreatic cancer and COVID-19 were predicted and screened via multiple approaches. Next, novel structure-based methods to improve the performance of peptide: MHC binding affinity prediction were developed, including an HLA modeling pipeline that provides structures for docking-based peptide binder validation, and hierarchical clustering of HLA I into supertypes and subtypes that have similar peptide binding specificity. Finally, the physicochemical properties governing the permeability of small molecules into multidrug-resistant Pseudomonas aeruginosa cells were selected using a random forest model. In conclusion, the use of machine learning methods could accelerate the drug development process at a lower cost and promote data-based decision-making if used properly

    Computational Approaches to Understanding the Structure, Dynamics, Functions, and Mechanisms of Various Bacterial Proteins

    Get PDF
    The 3D structure of a protein can be fundamentally useful for understanding protein function. In the absence of an experimentally determined structure, the most common way to obtain protein structures is to use homology modeling, or the mapping of the target sequence onto a closely related homolog with an available structure. However, despite recent efforts in structural biology, the 3D structures of many proteins remain unknown. Recent advances in genomic and metagenomic sequencing coupled with coevolution analysis and protein structure prediction have allowed for highly accurate models of proteins that were previously considered intractable to model due to the lack of suitable templates. Structural models obtained from homology modeling, coevolution-based modeling, or crystallography can then be used with other computational tools such as small molecule docking or molecular dynamics (MD) simulations to help understand protein function, dynamics, and mechanism.Here coevolution-based modeling was used to build a structural model of the HgcAB complex involved in mercury methylation (Chapter I). Based on the model it was proposed that conserved cysteines in HgcB are involved in shuttling mercury, methylmercury, or both. MD simulations and docking to a homology model of E. coli inosine monophosphate dehydrogenase (IMPDH) provided insights into how a single amino acid mutation could relieve inhibition by altering protein structure and dynamics (Chapter II). Coevolution-based structure prediction was also combined with docking, and experimental activity data to generate machine learning models that predict enzyme substrate scope for a series of bacterial nitrilases (Chapter III). Machine learning was also used to identify physicochemical properties that describe outer membrane permeability and efflux in E. coli and P. aeruginosa and new efflux pump inhibitors for the E. coli AcrAB-TolC efflux pump were identified using existing physicochemical guidelines in combination with small molecule docking to a homology model of AcrA (Chapter IV). Lastly, quantum mechanical/molecular mechanical simulations were used to study the mechanism of a key proton transfer step in Toho-1 beta-lactamase using experimentally determined structures of both the apo and cefotaxime-bound forms. These simulations revealed that substrate binding promotes catalysis by enhancing the favorability of this initial proton transfer step (Chapter V)

    Knowledge-based approaches for understanding structure-dynamics-function relationship in proteins

    Get PDF
    Proteins accomplish their functions through conformational changes, often brought about by changes in environmental conditions or ligand binding. Predicting the functional mechanisms of proteins is impossible without a deeper understanding of conformational transitions. Dynamics is the key link between the structure and function of proteins. The protein data bank (PDB) contains multiple structures of the same protein, which have been solved under different conditions, using different experimental methods or in complexes with different ligands. These alternate conformations of the same protein (or similar proteins) can provide important information about what conformational changes take place and how they are brought about. Though there have been multiple computational approaches developed to predict dynamics from structure information, little work has been done to exploit this apparent, but potentially informative, redundancy in the PDB. In this work I bridge this gap by exploring various knowledge-based approaches to understand the structure-dynamics relationship and how it translates into protein function. First, a novel method for constructing free energy landscapes for conformational changes in proteins is proposed by combining principal motions with knowledge-based potential energies and entropies from coarse-grained models of protein dynamics. Second, an innovative method for computing knowledge-based entropies for proteins using an inverse Boltzmann approach is introduced, similar to the manner in which statistical potentials were previously extracted. We hypothesize that amino acid contact changes observed in the course of conformational changes within a large set of proteins can provide information about local pairwise flexibilities or entropies. By combining this new entropy measure with knowledge-based potential functions, we formulate a knowledge-based free energy (KBF) function that we demonstrate outperforms other statistical potentials in its ability to identify native protein structures embedded with sets of decoys. Third, I apply the methods developed above in collaboration with experimentalists to understand the molecular mechanisms of conformational changes in several protein systems including cadherins and membrane transporters. This work introduces several ways that the huge data in the PDB can be utilized to understand the underlying principles behind the structure-dynamics-function relationships of proteins. Results from this work have several important applications in structural bioinformatics such as structure prediction, molecular docking, protein engineering and design. In particular, the new KBFs developed in this dissertation have immediate applications in emerging topics such as prediction of 3D structure from coevolving residues in sequence alignments as well as in identifying the phenotypic effects of mutants

    Identification, analysis and inference of point mutations associated to drug resistance in bacteria: a lesson learnt from the resistance of Streptococcus pneumoniae to quinolones

    Get PDF
    Antibiotic resistance is one of the biggest public health challenges of our time. Bacterial chemoresistance is the phenomenon whereby bacteria develop the ability to survive and multiply in the presence of an antibacterial drug; the expression of a resistant phenotype may be due to three fundamental mechanisms, including the expression of enzymes that inactivate the antibacterial drug, changes in the membrane permeability to antibiotics and the onset of point mutations causing the physical-chemical alteration of the antimicrobial targets. In recent decades, new antibiotic resistance mechanisms have emerged and are spreading globally, threatening human health and the ability to fight the most common infectious diseases. Quinolones, a novel class of antibiotics that bind bacterial topoisomerases and inhibit cell replication, have been important in limiting the spread of penicillin- and macrolides-resistant Streptococcus pneumoniae. However, alarmingly, resistance to quinolones is spreading recently. Resistance is caused by the appearance of point mutations in the bacterial topoisomerase and gyrase. Some mutations are well known, but some are not and the information about known molecular mechanisms causing resistance is sparse and not systematically collected and organised. This means that it cannot be used to infer new mutations in newly sequenced bacterial genes and study how they may affect the drug binding. The lack of structured, organized, and reusable information about point mutations associated with antibiotic resistance represents a critical issue and is a common pattern in the field. Here, we present a structural analysis of point mutations involved in the resistance to quinolones affecting the gyrase and topoisomerase genes in Streptococcus pneumoniae. Results, extended to other bacterial species, have been collected in a database, Quinores3D db, and can now be used – through a web server, Quinores3D finder - to analyze both known and yet unknown mutations occurring in bacterial topoisomerases and gyrases. The development, testing and deployment of Quinores3D db and Quinores3D finder are further results of this PhD thesis. Furthermore, structural data about point mutations associated with antibiotic resistance were used to train, test and validate a machine learning algorithm for the inference of still unknown mutations potentially involved in bacterial resistance to quinolone. As the performance of the algorithm, measured in terms of accuracy, sensitivity and specificity, is very promising, we plan to incorporate it in the web server to allow users to predict new mutations associated with bacterial resistance to quinolones

    Bacteriophage-host determinants: identification of bacteriophage receptors through machine learning techniques

    Get PDF
    Dissertação de mestrado em BioinformaticsBacterial resistance to antibiotics is nowadays becoming a major concern. Several reports indicate that bacteria are developing resistance mechanisms to various antibiotics. Moreover, the processes involved in the development of new antibiotics are lengthy and expensive. Therefore, an alternative to antibiotics is needed. One promising alternative are bacteriophages, viruses that specifically infect bacteria, causing their lysis. Hence, it would be interesting to discover which bacteria a specific phage recognizes. The bacterial receptors determine phage specificity, using tail spikes/fibres as receptor binding proteins to detect carbohydrates or proteins, in bacterial surface. Studying interactions between phage tail spikes/- fibres and bacterial receptors can allow the identification of interaction pairs. Machine learning algorithms can be used to find patterns in these interactions and build models to make predictions. In this work, PhageHost, a tool that predicts hosts at a strain level, for three species, E. coli, K. pneumoniae and A. baumannii was developed. Several data was extracted from GenBank, retrieving general, protein and coding information, for both phages and bacteria. The protein data was used to build an important phage protein function database, that allowed the classification of protein functions, namely, phage tail spikes/fibres. In the end, several machine learning models with relevant protein features were created to predict phage-host strain interactions. Compared with previously performed works, these models show better predictive power and the ability to perform strain-level predictions. For the best model, a Matthews correlation coefficient (MCC) of 96.6% and an F-score of 98.3% were obtained. These best predictive models were implemented online, in a server under the name PhageHost (https://galaxy.bio.di. uminho.pt).Resistência bacteriana a antibióticos está a tornar-se uma preocupação hoje em dia. Várias bactérias foram descritas desenvolvendo mecanismos de resistência a diversos antibióticos. Aliado a isto, estão os longos e dispendiosos processos envolvidos no desenvolvimento de antibióticos. Por isso, há a necessidade de procurar uma alternativa aos antibióticos. Uma alternativa promissora são os bacteriófagos, vírus que infetam especificamente bactérias e levam à sua lise. Posto isto, seria interessante descobrir qual a bactéria que um certo fago reconhece. A especificidade de fagos é dada pelos recetores da superfícies das bactérias que conseguem reconhecer. Eles usam proteínas das spikes/fibras para reconhecer recetires proteicos ou hidratos de carbono nas bactérias. Estudar as interações entre spikes/fibras das caudas de fagos e recetores bacterianos pode permitir a identificação de pares de interação. Algoritmos de aprendizagem máquina podem ser utilizados para descobrir padrões nestas interações e construir modelos para realizar previsões. Neste trabalho, a ferramenta PhageHost foi desenvolvida. Permite a previsão de hospedeiros ao nível da estirpe, para três espécies, E. coli, K. pneumoniae e A. baumannii. Vários dados foram extraídos do GenBank, nomeadamente informações gerais, de proteína e codificante, para fagos e bactérias. Com todos os dados proteicos, uma base de dados importante foi construída, que permitiu a classificação de funções proteicas, nomeadamente, spikes/fibras das caudas dos fagos. Finalmente, vários modelos de aprendizagem máquina, com características proteicas relevantes, capazes de prever interações fago-hospedeiro, a nível da estirpe. Em comparação com outros trabalhos semelhantes, estes modelos demonstraram melhor poder preditivo, assim como capacidade de prever interações a nível da estirpe. Para o melhor modelo foram obtidos um coeficiente de correlação de Matthews de 96.6% e um F-score de 98.3%. Os melhores modelos foram implementados online, num servidor com o nome PhageHost (https://galaxy.bio.di.uminho.pt)

    The Structural and Functional Study of Efflux Pumps Belonging to the RND Transporters Family from Gram-Negative Bacteria

    Get PDF
    Antimicrobial-resistant bacterial infections are a major and costly public health concern. Several pathogens are already pan-resistant, representing a major cause of mortality in patients suffering from nosocomial infections. Drug efflux pumps, which remove compounds from the bacterial cell, thereby lowering the antimicrobial concentration to sub-toxic levels, play a major role in multidrug resistance. In this Special Issue, we present up-to-date knowledge of the mechanism of RND efflux pumps, the identification and characterization of efflux pumps from emerging pathogens and their role in antimicrobial resistance, and progress made on the development of specific inhibitors. This collection of data could serve as a basis for antimicrobial drug discovery aimed at inhibiting drug efflux pumps to reverse resistance in some of the most resistant pathogens

    In silico analysis of membrane transport/permeability mechanisms

    Get PDF
    Lipid membranes are a fundamental component of living cells, mediating the physical separation of intracellular components from the external environment, as well as the different cellular organelles from cytoplasm. Transmembrane transport proteins confer permeability to lipid membranes, which is essential for nutrient translocation and energy metabolism. Crystallography of transmembrane proteins is a particularly challenging problem. Due to their natural localization and chemical properties only a limited number of structures are to date available at atomic resolution. In silico analysis can be successfully applied to address the structure and to propose testable models of transporters and pores and of their function. My PhD work focused on two main models: Pendrin (SLC26A4) and the Permeability Transition Pore (PTP). These two systems allowed me to investigate different membrane types and permeation mechanisms, i.e. the plasma membrane-specific anion exchange (SLC26A4) and the inner mitochondrial membrane (IMM) unselective PTP. Pendrin mutations are estimated to be the second most common genetic cause of human deafness, but a precise 3D structure of the protein is still missing. Aim of my work was to obviate the absence of structural information for pendrin transmembrane domain and to give a functional explanation for mutations collected in the MORL Deafness Variation Database. The human pendrin 3D model was inferred by homology with SLC26Dg and then validated analyzing the surface distribution of hydrophobic residues. The resulting high quality model was used to map 147 pathogenic human mutations. Three mutation clusters were found, while their localization suggested an innovative 14 transmembrane domain structure for pendrin. The nature of PTP has long remained a mystery. In 2013 Giorgio et. al. suggested dimers of F1FO (F)-ATP synthase to form the pore, however the exact PTP composition and how can a pore form from the energy-conserving enzyme is still matter of debate. PTP opening is triggered by an increased Ca2+ concentration in the mitochondrial matrix, and is favored by oxidative stress. To shed light on PTP function, I investigated the effect of Ca2+ binding to the Me2+ binding site of the F1 domain of F-ATP synthase through molecular dynamics (MD) simulations. A similar approach was also applied to the F-ATP synthase β subunit mutation T163S, which alters the relative affinity for Mg2+ and Ca2+. Experimental data show that Ca2+ binding stiffens the complex structure and that the T163S mutation induces resistance to PTP opening. Further, catalytic site rearrangement induced from different ion occupancy, as well as the mutation T163S, yields relevant variation of the interaction between F1 domain and OSCP subunit. I suggest that an unstructured loop between residues 82-131 of the β subunit transmits the structural rearrangement originated into catalytic site to the OSCP subunit and then to the inner membrane through the rigid lateral stalk. The critical role emerging for OSCP in the PTP regulation opens two parallel questions, i.e. (i) how the OSCP-mediated opening signal is transmitted to the trans-membrane region and (ii) what are the transmembrane PTP components. Variation in pore conductivity among species suggested that the putative pore-forming subunits may be different in different species. Sequence alignment was performed for all the subunits of F-ATP synthase, but we mainly focused on subunits e, g and b due to their localization in the complex and sequence conservation. Specific mutations affecting F-ATP synthase were collected and their functional effect is currently under analysis. In parallel, the presence and features of e, g and f subunits across eukaryotes was investigated by mean of phylogenetic analysis. Protein homologues of these specific subunits were found to be widespread in eukaryotes from yeast to plants while we found that Oomycetes lack subunits e and g and green algae subunit e. This observation suggest an ancient evolution for the F-ATP synthase dimerization subunits and possibly for the PTP. Further analysis and experimental validation are planned to clarify this aspect

    Systems Biology of Protein Secretion in Human Cells: Multi-omics Analysis and Modeling of the Protein Secretion Process in Human Cells and its Application.

    Get PDF
    Since the emergence of modern biotechnology, the production of recombinant pharmaceutical proteins has been an expanding field with high demand from industry. Pharmaceutical proteins have constituted the majority of top-selling drugs in the pharma industry during recent years. Many of these proteins require post-translational modifications and are therefore produced using mammalian cells such as Chinese Hamster Ovary cells. Despite frequent improvements in developing efficient cell factories for producing recombinant proteins, the natural complexity of the protein secretion process still poses serious challenges for the production of some proteins at the desired quantity and accepted quality. These challenges have been intensified by the growing demands of the pharma industry to produce novel products with greater structural complexity,\ua0\ua0as well as increasing expectations from regulatory authorities in the form of new quality control criteria to guarantee product safety.This thesis focuses on different aspects of the protein secretion process, including its engineering for cell factory development and analysis in diseases associated with its deregulation. A major part of this thesis involved the use of HEK293 cells as a human model cell-line for investigating the protein secretion process by generating different types of omics data and developing a computational model of the human protein secretion pathway. We compared the transcriptomic profile of cell lines producing erythropoietin (EPO; as a model secretory protein) at different rates to identify key genes that potentially contributed to higher rates of protein secretion. Moreover, by performing a transcriptomic comparison of cells producing green fluorescent protein (GFP; as a model non-secretory protein) with EPO producers, we captured differences that specifically relate to secretory protein production. We sought to further investigate the factors contributing to increased recombinant protein production by analyzing additional omic layers such as proteomics and metabolomics in cells that exhibited different rates of EPO production. Moreover, we developed a toolbox (HumanSec) to extend the reference human genome-scale metabolic model (Human1) to encompass protein-specific reactions for each secretory protein detected in our proteomics dataset. By generating cell-line specific protein secretion models and constraining the models using metabolomics data, we could predict the top host cell proteins (HCPs) that compete with EPO for metabolic and energetic resources.\ua0Finally,\ua0based on the detected patterns of changes in our multi-omics investigations combined with a protein secretion sensitivity analysis using the metabolic model, we identified a list of genes and pathways that potentially play a key role in recombinant protein production and could serve as promising candidates for targeted cell factory design.In another part of the thesis, we studied the link between the expression profiles of genes involved in the protein secretory pathway (PSP) and various hallmarks of cancer. By\ua0implementing a dual approach involving differential expression analysis and eight different machine learning algorithms, we investigated the expression changes in secretory pathway components across different cancer types to identify PSP genes whose expression was associated with tumor characteristics. We demonstrated that a combined machine learning and differential expression approach have a complementary nature and could highlight key PSP components relevant to features of tumor pathophysiology that may constitute potential therapeutic targets

    Exploiting Advanced Methods for Membrane Protein Structure Prediction

    Get PDF
    Recent strides in computational structural biology have opened up an opportunity to understand previously uncharacterised proteins. The under-representation of transmembrane proteins in the Protein Data Bank highlights the need to apply new and advanced bioinformatics methods to shed light on their structure and function. A protein’s structural information is crucial to understand its function and evolution. Currently, there is only experimental structural data for a tiny fraction of proteins. For instance, membrane proteins are encoded by 30% of the protein-coding genes of the human genome, but they only have a 3.5% representation in the Protein Data Bank (PDB). Membrane protein families are particularly poorly understood due to experimental difficulties, such as over-expression, which can result in toxicity to host cells, as well as difficulty in finding a suitable membrane mimetic to reconstitute the protein. Additionally, membrane proteins are much less conserved across species compared to water-soluble proteins, making sequence-based homologue identification a challenge, and in turn rendering homology modelling of these proteins more difficult. Until the structure of poorly characterised protein families can be elucidated experimentally, ab initio protein modelling can be used to predict a fold allowing for structure based function inferences. Such methods have made significant strides recently due to the availability of contact predictions, with these methods addressing larger targets than conventional fragment-assembly-based ab initio methods. This study initially focusses on the structure and function transmembrane proteins specifically in the process of autophagosome construction and demonstrates how covariance prediction data have multiple roles in modern structural bioinformatics: not just by acting as restraints for model making and serving for validation of the final models but by predicting domain boundaries and revealing the presence of cryptic internal repeats not evidenced by sequence analysis. Furthermore, we characterised a contact map feature characteristic of a re-entrant helix which may in future allow detection of this feature in other protein families. The recent innovations in computational structural biology were employed further giving rise to an opportunity to revise our current understanding of the structure and function of clinically important proteins. Through the modelling of the transmembrane Pfam families and subsequent mining of their structural libraries we identified the human Oca2 protein as a protein of interest. Oca2 is located on mature melanosomal membranes and mutations of Oca2 can result in a form of oculocutanous albinism which is the most prevalent and visually identifiable form of albinism. Sequence analysis predicts Oca2 to be a member of the SLC13 transporter family but it has not been classified into any existing SLC families. The modelling of Oca2 with AlphaFold2 and other advanced methods shows that, like SLC13 members, it consists of a scaffold and transport domain and displays a pseudo inverted repeat topology that includes re-entrant loops. This finding contradicts the prevailing consensus view of its topology. In addition to the scaffold and transport domains the presence of a cryptic GOLD domain is revealed that is likely responsible for its trafficking from the endoplasmic reticulum to the Golgi prior to localisation at the melanosomes and possesses known glycosylation sites. Analysis of the putative ligand binding site of the model shows the presence of highly conserved key asparagine residues that suggest Oca2 may be a Na+/dicarboxylate symporter. Known critical pathogenic mutations map to structural features present in the repeat regions that form the transport domain. Exploiting the AlphaFold2 multimeric modelling protocol in combination with conventional homology modelling allowed the building of a plausible homodimer in both an inward- and outward-facing conformation supporting an elevator-type transport mechanism
    corecore