6 research outputs found
Predicting the Most Tractable Protein Surfaces in the Human Proteome for Developing New Therapeutics
A critical step in the target identification phase of drug discovery is evaluating druggability, i.e., whether a protein can be targeted with high affinity using drug-like ligands. The overarching goal of my PhD thesis is to build a machine learning model that predicts the binding affinity that can be attained when addressing a given protein surface. I begin by examining the lead optimization phase of drug development, where I find that in a test set of 297 examples, 41 of these (14%) change binding mode when a ligand is elaborated. My analysis shows that while certain ligand physiochemical properties predispose changes in binding mode, particularly those properties that define fragments, simple structure-based modeling proves far more effective for identifying substitutions that alter the binding mode. My proposed measure of RMAC (rmsd after minimization of the aligned complex) can help determine whether a given ligand can be reliably elaborated without changing binding mode, thus enabling straightforward interpretation of the resulting structure-activity relationships. Moving forward, I next noted that a very popular machine learning algorithm for regression tasks, random forest, has a systematic bias in the predictions it generates; this bias is present in both real-world datasets and synthetic datasets. To address this, I define a numerical transformation that can be applied to the output of random forest models. This transformation fully removes the bias in the resulting predictions, and yields improved predictions across all datasets. Finally, taking advantage of this improved machine learning approach, I describe a model that predicts the “attainable binding affinity” for a given binding pocket on a protein surface. This model uses 13 physiochemical and structural features calculated from the protein structure, without any information about the ligand. While details of the ligand must (of course) contribute somewhat to the binding affinity, I find that this model still recapitulates the binding affinity for 848 different protein-ligand complexes (across 230 different proteins) with correlation coefficient 0.57. I further find that this model is not limited to “traditional” drug targets, but rather that it works just as well for emerging “non-traditional” drug targets such as inhibitors of protein-protein interactions. Collectively, I anticipate that the tools and insights generated in the course of my PhD research will play an important role in facilitating the key target selection phase of drug discovery projects
Recommended from our members
Machine learning in computational biology to accelerate high-throughput protein expression.
MotivationThe Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility.ResultsCombining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation.Availability and implementationWe present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online
Recommended from our members
Machine learning in computational biology to accelerate high-throughput protein expression
MotivationThe Human Protein Atlas (HPA) enables the simultaneous characterization of thousands of proteins across various tissues to pinpoint their spatial location in the human body. This has been achieved through transcriptomics and high-throughput immunohistochemistry-based approaches, where over 40 000 unique human protein fragments have been expressed in E. coli. These datasets enable quantitative tracking of entire cellular proteomes and present new avenues for understanding molecular-level properties influencing expression and solubility.ResultsCombining computational biology and machine learning identifies protein properties that hinder the HPA high-throughput antibody production pipeline. We predict protein expression and solubility with accuracies of 70% and 80%, respectively, based on a subset of key properties (aromaticity, hydropathy and isoelectric point). We guide the selection of protein fragments based on these characteristics to optimize high-throughput experimentation.Availability and implementationWe present the machine learning workflow as a series of IPython notebooks hosted on GitHub (https://github.com/SBRG/Protein_ML). The workflow can be used as a template for analysis of further expression and solubility [email protected] or [email protected] informationSupplementary data are available at Bioinformatics online
Mining of soluble enzymes from genomic databases
Enzymy jsou proteiny urychlujĂcĂ chemickĂ© reakce s velkĂ˝m potenciálem pro farmaceutickĂ˝ a obecnÄ› chemickĂ˝ prĹŻmysl. Enzymatická funkce je obvykle zajištÄ›na nÄ›kolika nepostradatelnĂ˝mi aminokyselinami, kterĂ© tvořà tzv. aktivnĂ mĂsto, kde se odehrává chemická reakce. V tĂ©to práci jsou prezentovány dva integrovanĂ© softwarovĂ© nástroje pro dolovánĂ a racionálnĂ vĂ˝bÄ›r novĂ˝ch rozpustnĂ˝ch enzymĹŻ - EnzymeMiner a SoluProt. EnzymeMiner sloužà k hledánĂ novĂ˝ch enzymĹŻ. Na vstupu vyĹľaduje jednu nebo vĂce sekvencĂ zvolenĂ©ho enzymu spolu se seznamem klĂÄŤovĂ˝ch aminokyselin. Tento seznam sloužà k zvýšenĂ pravdÄ›podobnosti, Ĺľe nalezenĂ˝ enzym bude mĂt podobnou funkci jako vstupnĂ enzym. VĂ˝stupem EnzymeMineru je mnoĹľina anotovanĂ˝ch sekvencĂ nalezenĂ˝ch v databázi. Za účelem ulehÄŤenĂ vĂ˝bÄ›ru nÄ›kolika málo kandidátĹŻ pro experimentálnĂ ověřenĂ v laboratoĹ™i integruje EnzymeMiner anotace z dostupnĂ˝ch databázĂ - informaci o zdrojovĂ©m organismu a prostĹ™edĂ, ve kterĂ©m se vyskytuje, a informaci o proteinovĂ˝ch domĂ©nách, ze kterĂ˝ch se enzym skládá. HlavnĂm kritĂ©riem pro vĂ˝bÄ›r kandidátĹŻ je rozpustnost predikovaná druhĂ˝m prezentovanĂ˝m nástrojem, SoluProtem. SoluProt je metoda zaloĹľená na strojovĂ©m uÄŤenĂ, která predikuje heterolognĂ rozpustnou expresi proteinu v organismu Escherichia coli . Vstupem je sekvence a vĂ˝stupem je pravdÄ›podobnost, Ĺľe protein bude exprimován v rozpustnĂ© formÄ›. SoluProt vyuĹľĂvá model gradient boosting machine a byl trĂ©nován na datovĂ© sadÄ› odvozenĂ© od databáze TargetTrack. PĹ™i srovnánĂ na vyváženĂ© nezávislĂ© datovĂ© sadÄ› odvozenĂ© z databáze NESG dosáhl SoluProt pĹ™esnosti 58,5 % a hodnoty AUC 0,62, ÄŤĂmĹľ lehce pĹ™evyšuje ostatnĂ existujĂcĂ nástroje. Nástroje EnzymeMiner i SoluProt jsou ÄŤasto vyuĹľĂvány Ĺ™adou uĹľivatelĹŻ z oblasti proteinovĂ©ho inĹľenĂ˝rstvĂ za účelem hledánĂ novĂ˝ch rozpustnĂ˝ch biokatalyzátorĹŻ chemickĂ˝ch reakcĂ. Ty majĂ velkĂ˝ potenciál snĂĹľit energetickou nároÄŤnost a ekologickou zátěž mnoha prĹŻmyslovĂ˝ch procesĹŻ.Enzymes are proteins accelerating chemical reactions, which makes them attractive targets for both pharmaceutical and industrial applications. The enzyme function is mediated by several essential amino acids which form the optimal chemical environment to catalyse the reaction. In this work, two integrated bioinformatics tools for mining and rational selection of novel soluble enzymes, EnzymeMiner and SoluProt, are presented. EnzymeMiner uses one or more enzyme sequences as input along with a description of essential residues to search the protein database. The description of essential amino acids is used to increase the probability of similar enzymatic function. EnzymeMiner output is a set of annotated database hits. EnzymeMiner integrates taxonomic, environmental, and protein domain annotations to facilitate selection of promising targets for experiments. The main prioritization criterion is solubility predicted by the second tool being presented, SoluProt. SoluProt is a machine-learning method for the prediction of soluble protein expression in Escherichia coli . The input is a protein sequence and the output is the probability of such protein to be soluble. SoluProt exploits a gradient boosting machine to decide on the output prediction class. The tool was trained on TargetTrack database. When evaluated against a balanced independent test set derived from the NESG database, SoluProt accuracy was 58.5% and its AUC 0.62, slightly exceeding those of a suite of alternative solubility prediction tools. Both EnzymeMiner and SoluProt are frequently used by the protein engineering community to find novel soluble biocatalysts for chemical reactions. These have a great potential to decrease energetic consumption and environmental burden of many industrial chemical processes.