13 research outputs found

    Computational Feature of Selection and Classification of RET Phenotypic Severity

    Get PDF
    pre-printAlthough many reported mutations in the RET oncogene have been directly associated with hereditary thyroid carcinoma, other mutations are labelled as uncertain gene variants because they have not been clearly associated with a clinical phenotype. The process of determining the severity of a mutation is costly and time consuming. Informatics tools and methods may aid to bridge this genotype-phenotype gap. Towards this goal, machine-learning classification algorithms were evaluated for their ability to distinguish benign and pathogenic RET gene variants as characterized by differences in values of physicochemical properties of the residue present in the wild type and the one in the mutated sequence. Representative algorithms were chosen from different categories of machine learning classification techniques, including rules, bayes, and regression, nearest neighbour, support vector machines and trees. Machine-learning models were then compared to well-established techniques used for mutation severity prediction. Machine-learning classification can be used to accurately predict RET mutation status using primary sequence information only. Existing algorithms that are based on sequence homology (ortholog conservation) or protein structural data are not necessarily superior

    Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.</p> <p>Results</p> <p>The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers.</p> <p>Conclusions</p> <p>The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.</p

    Machine Learning-Assisted Directed Evolution Navigates a Combinatorial Epistatic Fitness Landscape with Minimal Screening Burden

    Get PDF
    Due to screening limitations, in directed evolution (DE) of proteins it is rarely feasible to fully evaluate combinatorial mutant libraries made by mutagenesis at multiple sites. Instead, DE often involves a single-step greedy optimization in which the mutation in the highest-fitness variant identified in each round of single-site mutagenesis is fixed. However, because the effects of a mutation can depend on the presence or absence of other mutations, the efficiency and effectiveness of a single-step greedy walk is influenced by both the starting variant and the order in which beneficial mutations are identified—the process is path-dependent. We recently demonstrated a path-independent machine learning-assisted approach to directed evolution (MLDE) that allows in silico screening of full combinatorial libraries made by simultaneous saturation mutagenesis, thus explicitly capturing the effects of cooperative mutations and bypassing the path-dependence that can limit greedy optimization. Here, we thoroughly investigate and optimize an MLDE workflow by testing a number of design considerations of the MLDE pipeline. Specifically, we (1) test the effects of different encoding strategies on MLDE efficiency, (2) integrate new models and a training procedure more amenable to protein engineering tasks, and (3) incorporate training set design strategies to avoid information-poor low-fitness protein variants (“holes”) in the training data. When applied to an epistatic, hole-filled, four-site combinatorial fitness landscape of protein G domain B1 (GB1), the resulting focused training MLDE (ftMLDE) protocol achieved the global fitness maximum up to 92% of the time at a total screening burden of 470 variants. In contrast, minimal-screening-burden single-step greedy optimization over the GB1 fitness landscape reached the global maximum just 1.2% of the time; ftMLDE matching this minimal screening burden (80 total variants) achieved the global optimum up to 9.6% of the time with a 49% higher expected maximum fitness achieved. To facilitate further development of MLDE, we present the MLDE software package (https://github.com/fhalab/MLDE), which is designed for use by protein engineers without computational or machine learning expertise

    Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets.

    Get PDF
    Background While a large body of work exists on comparing and benchmarking descriptors of molecular structures, a similar comparison of protein descriptor sets is lacking. Hence, in the current work a total of 13 amino acid descriptor sets have been benchmarked with respect to their ability of establishing bioactivity models. The descriptor sets included in the study are Z-scales (3 variants), VHSE, T-scales, ST-scales, MS-WHIM, FASGAI, BLOSUM, a novel protein descriptor set (termed ProtFP (4 variants)), and in addition we created and benchmarked three pairs of descriptor combinations. Prediction performance was evaluated in seven structure-activity benchmarks which comprise Angiotensin Converting Enzyme (ACE) dipeptidic inhibitor data, and three proteochemometric data sets, namely (1) GPCR ligands modeled against a GPCR panel, (2) enzyme inhibitors (NNRTIs) with associated bioactivities against a set of HIV enzyme mutants, and (3) enzyme inhibitors (PIs) with associated bioactivities on a large set of HIV enzyme mutants. Results The amino acid descriptor sets compared here show similar performance ( 0.3 log units RMSE difference and >0.7 difference in MCC). Combining different descriptor sets generally leads to better modeling performance than utilizing individual sets. The best performers were Z-scales (3) combined with ProtFP (Feature), or Z-Scales (3) combined with an average Z-Scale value for each target, while ProtFP (PCA8), ST-Scales, and ProtFP (Feature) rank last. Conclusions While amino acid descriptor sets capture different aspects of amino acids their ability to be used for bioactivity modeling is still – on average – surprisingly similar. Still, combining sets describing complementary information consistently leads to small but consistent improvement in modeling performance (average MCC 0.01 better, average RMSE 0.01 log units lower). Finally, performance differences exist between the targets compared thereby underlining that choosing an appropriate descriptor set is of fundamental for bioactivity modeling, both from the ligand- as well as the protein side

    Best practices for machine learning in antibody discovery and development

    Full text link
    Over the past 40 years, the discovery and development of therapeutic antibodies to treat disease has become common practice. However, as therapeutic antibody constructs are becoming more sophisticated (e.g., multi-specifics), conventional approaches to optimisation are increasingly inefficient. Machine learning (ML) promises to open up an in silico route to antibody discovery and help accelerate the development of drug products using a reduced number of experiments and hence cost. Over the past few years, we have observed rapid developments in the field of ML-guided antibody discovery and development (D&D). However, many of the results are difficult to compare or hard to assess for utility by other experts in the field due to the high diversity in the datasets and evaluation techniques and metrics that are across industry and academia. This limitation of the literature curtails the broad adoption of ML across the industry and slows down overall progress in the field, highlighting the need to develop standards and guidelines that may help improve the reproducibility of ML models across different research groups. To address these challenges, we set out in this perspective to critically review current practices, explain common pitfalls, and clearly define a set of method development and evaluation guidelines that can be applied to different types of ML-based techniques for therapeutic antibody D&D. Specifically, we address in an end-to-end analysis, challenges associated with all aspects of the ML process and recommend a set of best practices for each stage

    Hypoxia triggers TAZ phosphorylation in basal a triple negative breast cancer cells

    Get PDF
    Hypoxia and HIF signaling drive cancer progression and therapy resistance and have been demonstrated in breast cancer. To what extent breast cancer subtypes differ in their response to hypoxia has not been resolved. Here, we show that hypoxia similarly triggers HIF1 stabilization in luminal and basal A triple negative breast cancer cells and we use high throughput targeted RNA sequencing to analyze its effects on gene expression in these subtypes. We focus on regulation of YAP/TAZ/TEAD targets and find overlapping as well as distinct target genes being modulated in luminal and basal A cells under hypoxia. We reveal a HIF1 mediated, basal A specific response to hypoxia by which TAZ, but not YAP, is phosphorylated at Ser89. While total YAP/TAZ localization is not affected by hypoxia, hypoxia drives a shift of [p-TAZ(Ser89)/p-YAP(Ser127)] from the nucleus to the cytoplasm in basal A but not luminal breast cancer cells. Cell fractionation and YAP knock-out experiments confirm cytoplasmic sequestration of TAZ(Ser89) in hypoxic basal A cells. Pharmacological and genetic interference experiments identify c-Src and CDK3 as kinases involved in such phosphorylation of TAZ at Ser89 in hypoxic basal A cells. Hypoxia attenuates growth of basal A cells and the effect of verteporfin, a disruptor of YAP/TAZ-TEAD-mediated transcription, is diminished under those conditions, while expression of a TAZ-S89A mutant does not confer basal A cells with a growth advantage under hypoxic conditions, indicating that other hypoxia regulated pathways suppressing cell growth are dominant.Toxicolog

    Doctor of Philosophy

    Get PDF
    dissertationRapidly evolving technologies such as chip arrays and next-generation sequencing are uncovering human genetic variants at an unprecedented pace. Unfortunately, this ever growing collection of gene sequence variation has limited clinical utility without clear association to disease outcomes. As electronic medical records begin to incorporate genetic information, gene variant classification and accurate interpretation of gene test results plays a critical role in customizing patient therapy. To verify the functional impact of a given gene variant, laboratories rely on confirming evidence such as previous literature reports, patient history and disease segregation in a family. By definition variants of uncertain significance (VUS) lack this supporting evidence and in such cases, computational tools are often used to evaluate the predicted functional impact of a gene mutation. This study evaluates leveraging high quality genotype-phenotype disease variant data from 20 genes and 3986 variants, to develop gene-specific predictors utilizing a combination of changes in primary amino acid sequence, amino acid properties as descriptors of mutation severity and Naïve Bayes classification. A Primary Sequence Amino Acid Properties (PSAAP) prediction algorithm was then combined with well established predictors in a weighted Consensus sum in context of gene-specific reference intervals for known phenotypes. PSAAP and Consensus were also used to evaluate known variants of uncertain significance in the RET proto-oncogene as a model gene. The PSAAP algorithm was successfully extended to many genes and diseases. Gene-specific algorithms typically outperform generalized prediction tools. Characteristic mutation properties of a given gene and disease may be lost when diluted into genomewide data sets. A reliable computational phenotype classification framework with quantitative metrics and disease specific reference ranges allows objective evaluation of novel or uncertain gene variants and augments decision making when confirming clinical information is limited

    Field-based Proteochemometric Models Derived from 3D Protein Structures : A Novel Approach to Visualize Affinity and Selectivity Features

    Get PDF
    Designing drugs that are selective is crucial in pharmaceutical research to avoid unwanted side effects. To decipher selectivity of drug targets, computational approaches that utilize the sequence and structural information of the protein binding pockets are frequently exploited. In addition to methods that rely only on protein information, quantitative approaches such as proteochemometrics (PCM) use the combination of protein and ligand descriptions to derive quantitative relationships with binding affinity. PCM aims to explain cross-interactions between the different proteins and ligands, hence facilitating our understanding of selectivity. The main goal of this dissertation is to develop and apply field-based PCM to improve the understanding of relevant molecular interactions through visual illustrations. Field-based description that depends on the 3D structural information of proteins enhances visual interpretability of PCM models relative to the frequently used sequence-based descriptors for proteins. In these field-based PCM studies, knowledge-based fields that explain polarity and lipophilicity of the binding pockets and WaterMap-derived fields that elucidate the positions and energetics of water molecules are used together with the various 2D / 3D ligand descriptors to investigate the selectivity profiles of kinases and serine proteases. Field-based PCM is first applied to protein kinases, for which designing selective inhibitors has always been a challenge, owing to their highly similar ATP binding pockets. Our studies show that the method could be successfully applied to pinpoint the regions influencing the binding affinity and selectivity of kinases. As an extension of the initial studies conducted on a set of 50 kinases and 80 inhibitors, field-based PCM was used to build classification models on a large dataset (95 kinases and 1572 inhibitors) to distinguish active from inactive ligands. The prediction of the bioactivities of external test set compounds or kinases with accuracies over 80% (Matthews correlation coefficient, MCC: ~0.50) and area under the ROC curve (AUC) above 0.8 together with the visual inspection of the regions promoting activity demonstrates the ability of field-based PCM to generate both predictive and visually interpretable models. Further, the application of this method to serine proteases provides an overview of the sub-pocket specificities, which is crucial for inhibitor design. Additionally, alignment-independent Zernike descriptors derived from fields were used in PCM models to study the influence of protein superimpositions on field comparisons and subsequent PCM modelling.Lääketutkimuksessa selektiivisten lääkeaineiden suunnittelu on ratkaisevan tärkeää haittavaikutusten välttämiseksi. Kohdeselektiivisyyden selvittämiseen käytetään usein tietokoneavusteisia menetelmiä, jotka hyödyntävät proteiinien sitoutumiskohtien sekvenssi- ja rakennetietoja. Proteiinilähtöisten menetelmien lisäksi kvantitatiiviset menetelmät kuten proteokemometria (proteochemometrics, PCM) yhdistävät sekä proteiinin että ligandin tietoja muodostaessaan kvantitatiivisen suhteen sitoutumisaffiniteettiin. PCM pyrkii selittämään eri proteiinien ja ligandien vuorovaikutuksia ja näin auttaa ymmärtämään selektiivisyyttä. Väitöstutkimuksen tavoitteena oli kehittää ja hyödyntää kenttäpohjaista proteokemometriaa, joka auttaa ymmärtämään relevantteja molekyylitasoisia vuorovaikutuksia visuaalisen esitystavan kautta. Proteiinin kolmiulotteisesta rakenteesta riippuva kenttäpohjainen kuvaus helpottaa PCM-mallien tulkintaa, etenkin usein käytettyihin sekvenssipohjaisiin kuvauksiin verrattuna. Näissä kenttäpohjaisissa PCM-mallinnuksissa käytettiin tietoperustaisia sitoutumistaskun polaarisuutta ja lipofiilisyyttä kuvaavia kenttiä ja WaterMap-ohjelman tuottamia vesimolekyylien sijaintia ja energiaa havainnollistavia kenttiä yhdessä lukuisten ligandia kuvaavien 2D- ja 3D-deskriptorien kanssa. Malleja sovellettiin kinaasien ja seriiniproteaasien selektiivisyysprofiilien tutkimukseen. Tutkimuksen ensimmäisessä osassa kenttäpohjaista PCM-mallinnusta sovellettiin proteiinikinaaseihin, joille selektiivisten inhibiittorien suunnittelu on haastavaa samankaltaisten ATP sitoutumistaskujen takia. Tutkimuksemme osoitti menetelmän soveltuvan kinaasien sitoutumisaffiniteettia ja selektiivisyyttä ohjaavien alueiden osoittamiseen. Jatkona 50 kinaasia ja 80 inhibiittoria käsittäneelle alkuperäiselle tutkimukselle rakensimme kenttäpohjaisia PCM-luokittelumalleja suuremmalle joukolle kinaaseja (95) ja inhibiittoreita (1572) erotellaksemme aktiiviset ja inaktiiviset ligandit toisistaan. Ulkoisen testiyhdiste- tai testikinaasijoukon bioaktiivisuuksien ennustaminen yli 80 % tarkkuudella (Matthews korrelaatiokerroin, MCC noin 0,50) ja ROC-käyrän alle jäävä ala (AUC) yli 0,8 yhdessä aktiivisuutta tukevien alueiden visuaalisen tarkastelun kanssa osoittivat kenttäpohjaisen PCM:n pystyvän tuottamaan sekä ennustavia että visuaalisesti ymmärrettäviä malleja. Tutkimuksen toisessa osassa metodin soveltaminen seriiniproteaaseihin tuotti yleisnäkemyksen sitoutumistaskun eri osien spesifisyyksistä, mikä on ensiarvoisen tärkeää inhibiittorien suunnittelulle. Lisäksi kentistä johdettuja, proteiinien päällekkäinasettelusta riippumattomia Zernike-deskriptoreita hyödynnettiin PCM-malleissa arvioidaksemme proteiinien päällekkäinasettelun vaikutusta kenttien vertailuun ja sen jälkeiseen PCM-mallinnukseen
    corecore