332 research outputs found

    Examples of SAR-centric patent mining using open resources

    Get PDF

    Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

    Get PDF
    Measurements of protein-ligand interactions have reproducibility limits due to experimental errors. Any model based on such assays will consequentially have such unavoidable errors influencing their performance which should ideally be factored into modelling and output predictions, such as the actual standard deviation of experimental measurements (σ) or the associated comparability of activity values between the aggregated heterogenous activity units (i.e., Ki versus IC50 values) during dataset assimilation. However, experimental errors are usually a neglected aspect of model generation. In order to improve upon the current state-of-the-art, we herein present a novel approach toward predicting protein-ligand interactions using a Probabilistic Random Forest (PRF) classifier. The PRF algorithm was applied toward in silico protein target prediction across ~ 550 tasks from ChEMBL and PubChem. Predictions were evaluated by taking into account various scenarios of experimental standard deviations in both training and test sets and performance was assessed using fivefold stratified shuffled splits for validation. The largest benefit in incorporating the experimental deviation in PRF was observed for data points close to the binary threshold boundary, when such information was not considered in any way in the original RF algorithm. For example, in cases when σ ranged between 0.4-0.6 log units and when ideal probability estimates between 0.4-0.6, the PRF outperformed RF with a median absolute error margin of ~ 17%. In comparison, the baseline RF outperformed PRF for cases with high confidence to belong to the active class (far from the binary decision threshold), although the RF models gave errors smaller than the experimental uncertainty, which could indicate that they were overtrained and/or over-confident. Finally, the PRF models trained with putative inactives decreased the performance compared to PRF models without putative inactives and this could be because putative inactives were not assigned an experimental pXC50 value, and therefore they were considered inactives with a low uncertainty (which in practice might not be true). In conclusion, PRF can be useful for target prediction models in particular for data where class boundaries overlap with the measurement uncertainty, and where a substantial part of the training data is located close to the classification threshold

    Disain ja modelleerimine HIV-1 pöördtranskriptaasi ja Malaaria ravimite väljatöötamise varajases faasis

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneKäesolev uurimus keskendub kahele ohtlikule infektsioonhaigusele: inimese immuunpuudulikkuse viirus tüüp 1 (HIV-1) ja malaaria. Uue ravimi väljatöötamine algusest lõppuni on aega nõudev ning kulukas protsess, mis jaotatakse viieks etapiks: baas uurimistöö, põhi sihtmärgi ja baas ühendi(te) leidmine, eelkliiniline arendus, kliiniline arendus ja vajalike dokumentide esitamine ravimiametisse. Antud väitekirjas keskendutakse kahele esimesele etappidele, mida tuntakse ka varajase ravimiarenduse faasina. HIV-1 uurimisel oli kaks põhisuunda. Esmalt tuginedes eelnevalt tehtud virtuaalsõelumise tulemustele teostati uudsete s-triasiini derivaatide avastamine, disainimine, ja süntees, mille tulemused valideeriti eksperimentaalselt ning analüüsiti valk-ligand interaktsioonimudelite abil. Kõige tõhusam HIV-1 mitte-nukleosiidne pöördtranskriptaasi inhibiitor oli madala molekulmassiga, heade ligandi efektiivsust näitavate parameetritega, ja madala toksilisusega, võimaldades edasist modifitseerimist ja arendamist. Tehtud aktiivse keemilise struktuuri avastus motiveeris HIV-1 inhibiitorite keemilise struktuuriruumi laiemat uurimist, et kindlaks teha kas uudsed s-triasiinid moodustavad ka unikaalsed keemiliste ühendite grupi HIV-1 mitte-nukleosiidsete pöördtranskriptaasi inhibiitorite maastikul. Selle läbiviimiseks koostati, korrastati ja kureeriti ChEMBL-i andmebaasist saadud andmetest fokusseeritud andmeseeriad HIV-1 mitte-nukleosiidne ja nukelosiidsete pöördtranskriptaasi inhibiitorite jaoks, kuhu lisati ka avastatud s-triasiini derivaadid. Andmeseeriate struktuuride analüüs hierarhilise klassifitseerimise meetodil grupeeris ühendid keemiliste struktuuritüüpide (nn. vanematüüp) järgi. Selgus, et avastatud s-triasiinid moodustasid eraldiseisva struktuuritüübi grupi. Leitud struktuuritüüpe analüüsiti, lisades juurde ka vastavad mõõdetud seondumise afiinsuse tasakaalukonstandid (Ki). Selle analüüsi käigus toodi välja struktuurifragmendid, mis omavad olulist rolli afiinsuse ning stabiilsuse seisukohast. Lisaks võimaldasid struktuurselt mitmekesised ja unikaalsed HIV-1 mitte-nukleosiidne ja nukelosiidsete pöördtranskriptaasi inhibiitorite andmeseeriad esmakordselt arendada kirjeldavaid kvantitatiivsete struktuur-aktiivsus sõltuvuste prognoosmudeleid, mida on võimalik kasutada järgnevas uurimustöös uute aktiivsete keemiliste ühendite avastamisel. Selleks et leida uudseid malaaria ravimikanditaate koostati ja kureeriti süsteemselt andmebaas eksperimentaalsete anti-Plasmodium andmetega kasutades nii asutusesisesed, kui ka ChEMBL-i andmebaasis olevad andmed. Saadud andmete ulatusliku kureerimise, filtreerimise ning ühendamise tulemusena saadi kolmkümmend modelleeritavat andmeseeriat, millele koostati klassifitseerimise mudelid, eesmärgiga eristada aktiivsed ja mitteaktiivsed ühendid. Nendest seitsmeteistkümnele andmeseeriale saadi ennustusvõimelised nn. üksmeele (inglise keeles consensus) mudelid. Loodud mudelitega teostati ennustusi asutusesiseselt olemasolevatele curcuminoidide seerjale ning nende analoogidele, millest parima ennustusvõimega ühenditele teostati eksperimentaalne valideerimine in vitro katsetega, kus aktiivseks osutusid seitseteist ühendit, mida saab edasistes uuringutes täpsemini uurida. Samuti tehti kindaks, et arvutuslikult tuvastatud mitteaktiivsed ühendid jäid mitteaktiivseks ka eksperimentaalse valideerimise käigus, mis näitas süsteemselt kureeritud ja koostatud andmeseeriate ning prognoosmudelite jätkusuutlikust.Current thesis focused on study of two highly prevalent infections affecting many regions in the world: alaria and human immunodeficiency virus 1 (HIV-1). Developing a new drug from scratch is time consuming and costly process. This could be divided into five stages: basic research, lead target and lead compound(s) discovery, preclinical development, clinical development and filing to drug administration agency. Present thesis focused on basic research and lead compound discovery stages, i.e. to the early drug discovery. For the HIV-1, the focus was two-fold. First, based on the earlier multi-objective in silico screening, novel s-triazine derivatives were designed, discovered, synthesized, and findings where supported by the modelling tasks and validated with biological evaluation. The most potent compound is with small molecular size, potent ligand efficiencies, and measured low toxicity permitting further exploration and modifications. Second, the discovered new bioactive s-triazines motivated to analyse the chemical landscape of HIV-1 RT inhibitors. For this the dataset was systematically created and curated for HIV-1 NNRT (non-nucleoside reverse transcriptase) and NRT (nucleoside reverse transcriptase) inhibitors based on data from ChEMBL database. The hierarchical classification of scaffold structures of curated datasets revealed common chemical parent types for the compounds, hierarchy in chemical structures and showed that discovered s-triazines formed a separate structural parent type group. Each group of compounds related to the parent type was analysed and examined together with corresponding binding affinity equilibrium constants (Ki). The structural fragments affecting the potency and stability of compounds were highlighted. The structurally diverse datasets for the HIV-1 NNRTIs and NRTIs with binding affinity equilibrium constants allowed development of novel descriptive and predictive QSAR models for log Ki, that in future will help in design of new compounds. In order to discover new promising antimalarial compounds, the experimental anti-Plasmodium data was gathered and systematically curated from in-house experimental studies and expanded with data from ChEMBL database. Extracted data was carefully extensively curated, fused, filtered, and grouped into thirty data sets for the modelling. The consensus models for each dataset for the classification of active/inactive compounds were established and seventeen models with promising prediction ability were used in consensus predictions and in identifying the series of curcuminoids and their structural analogues as potential inhibitors for the malaria. The selection of compounds was experimentally validated, i.e. tested in vitro, revealing seventeen potentially active compounds for further testing and modifications. The validation showed that computationally predicted inactive compounds were also inactive in experiment, being additional proof for the quality of data curation and dataset assembly process forming the ground for the modelling task

    In Silico Mining for Antimalarial Structure-Activity Knowledge and Discovery of Novel Antimalarial Curcuminoids.

    Get PDF
    Malaria is a parasitic tropical disease that kills around 600,000 patients every year. The emergence of resistant Plasmodium falciparum parasites to artemisinin-based combination therapies (ACTs) represents a significant public health threat, indicating the urgent need for new effective compounds to reverse ACT resistance and cure the disease. For this, extensive curation and homogenization of experimental anti-Plasmodium screening data from both in-house and ChEMBL sources were conducted. As a result, a coherent strategy was established that allowed compiling coherent training sets that associate compound structures to the respective antimalarial activity measurements. Seventeen of these training sets led to the successful generation of classification models discriminating whether a compound has a significant probability to be active under the specific conditions of the antimalarial test associated with each set. These models were used in consensus prediction of the most likely active from a series of curcuminoids available in-house. Positive predictions together with a few predicted as inactive were then submitted to experimental in vitro antimalarial testing. A large majority from predicted compounds showed antimalarial activity, but not those predicted as inactive, thus experimentally validating the in silico screening approach. The herein proposed consensus machine learning approach showed its potential to reduce the cost and duration of antimalarial drug discovery

    ChEMBL: a large-scale bioactivity database for drug discovery

    Get PDF
    ChEMBL is an Open Data database containing binding, functional and ADMET information for a large number of drug-like bioactive compounds. These data are manually abstracted from the primary published literature on a regular basis, then further curated and standardized to maximize their quality and utility across a wide range of chemical biology and drug-discovery research problems. Currently, the database contains 5.4 million bioactivity measurements for more than 1 million compounds and 5200 protein targets. Access is available through a web-based interface, data downloads and web services at: https://www.ebi.ac.uk/chembldb

    Data Sharing in Chemistry: Lessons Learned and a Case for Mandating Structured Reaction Data

    Get PDF
    The past decade has seen a number of impressive developmentsinpredictive chemistry and reaction informatics driven by machine learningapplications to computer-aided synthesis planning. While many of thesedevelopments have been made even with relatively small, bespoke datasets, in order to advance the role of AI in the field at scale, theremust be significant improvements in the reporting of reaction data.Currently, the majority of publicly available data is reported inan unstructured format and heavily imbalanced toward high-yieldingreactions, which influences the types of models that can be successfullytrained. In this Perspective, we analyze several data curation andsharing initiatives that have seen success in chemistry and molecularbiology. We discuss several factors that have contributed to theirsuccess and how we can take lessons from these case studies and applythem to reaction data. Finally, we spotlight the Open Reaction Databaseand summarize key actions the community can take toward making reactiondata more findable, accessible, interoperable, and reusable (FAIR),including the use of mandates from funding agencies and publishers

    The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands

    Get PDF
    The IUPHAR/BPS Guide to PHARMACOLOGY (GtoPdb, http://www.guidetopharmacology.org) provides expert-curated molecular interactions between successful and potential drugs and their targets in the human genome. Developed by the International Union of Basic and Clinical Pharmacology (IUPHAR) and the British Pharmacological Society (BPS), this resource, and its earlier incarnation as IUPHAR-DB, is described in our 2014 publication. This update incorporates changes over the intervening seven database releases. The unique model of content capture is based on established and new target class subcommittees collaborating with in-house curators. Most information comes from journal articles, but we now also index kinase cross-screening panels. Targets are specified by UniProtKB IDs. Small molecules are defined by PubChem Compound Identifiers (CIDs); ligand capture also includes peptides and clinical antibodies. We have extended the capture of ligands and targets linked via published quantitative binding data (e.g. Ki, IC50 or Kd). The resulting pharmacological relationship network now defines a data-supported druggable genome encompassing 7% of human proteins. The database also provides an expanded substrate for the biennially published compendium, the Concise Guide to PHARMACOLOGY. This article covers content increase, entity analysis, revised curation strategies, new website features and expanded download options
    corecore