17 research outputs found
Consideration of predicted small-molecule metabolites in computational toxicology
Xenobiotic metabolism has evolved as a key protective system of organisms against potentially harmful chemicals or compounds typically not present in a particular organism. The system's primary purpose is to chemically transform xenobiotics into metabolites that can be excreted via renal or biliary routes. However, in a minority of cases, the metabolites formed are toxic, sometimes even more toxic than the parent compound. Therefore, the consideration of xenobiotic metabolism clearly is of importance to the understanding of the toxicity of a compound. Nevertheless, most of the existing computational approaches for toxicity prediction do not explicitly take metabolism into account and it is currently not known to what extent the consideration of (predicted) metabolites could lead to an improvement of toxicity prediction. In order to study how predictive metabolism could help to enhance toxicity prediction, we explored a number of different strategies to integrate predictions from a state-of-the-art metabolite structure predictor and from modern machine learning approaches for toxicity prediction. We tested the integrated models on five toxicological endpoints and assays, including in vitro and in vivo genotoxicity assays (AMES and MNT), two organ toxicity endpoints (DILI and DICC) and a skin sensitization assay (LLNA). Overall, the improvements in model performance achieved by including metabolism data were minor (up to +0.04 in the F1 scores and up to +0.06 in MCCs). In general, the best performance was obtained by averaging the probability of toxicity predicted for the parent compound and the maximum probability of toxicity predicted for any metabolite. Moreover, including metabolite structures as further input molecules for model training slightly improved the toxicity predictions obtained by this averaging approach. However, the high complexity of the metabolic system and associated uncertainty about the likely metabolites apparently limits the benefit of considering predicted metabolites in toxicity prediction
Novel clinical phenotypes, drug categorization, and outcome prediction in drug-induced cholestasis: Analysis of a database of 432 patients developed by literature review and machine learning support
publishedVersio
Consideration of predicted small-molecule metabolites in computational toxicology
Exploration of computational approaches for including metabolism information in machine learning models for toxicity prediction.</jats:p
Predicting the Mitochondrial Toxicity of Small Molecules: Insights from Mechanistic Assays and Cell Painting Data
Mitochondrial toxicity
is a significant concern in the drug discovery
process, as compounds that disrupt the function of these organelles
can lead to serious side effects, including liver injury and cardiotoxicity.
Different in vitro assays exist to detect mitochondrial toxicity at
varying mechanistic levels: disruption of the respiratory chain, disruption
of the membrane potential, or general mitochondrial dysfunction. In
parallel, whole cell imaging assays like Cell Painting provide a phenotypic
overview of the cellular system upon treatment and enable the assessment
of mitochondrial health from cell profiling features. In this study,
we aim to establish machine learning models for the prediction of
mitochondrial toxicity, making the best use of the available data.
For this purpose, we first derived highly curated datasets of mitochondrial
toxicity, including subsets for different mechanisms of action. Due
to the limited amount of labeled data often associated with toxicological
endpoints, we investigated the potential of using morphological features
from a large Cell Painting screen to label additional compounds and
enrich our dataset. Our results suggest that models incorporating
morphological profiles perform better in predicting mitochondrial
toxicity than those trained on chemical structures alone (up to +0.08
and +0.09 mean MCC in random and cluster cross-validation, respectively).
Toxicity labels derived from Cell Painting images improved the predictions
on an external test set up to +0.08 MCC. However, we also found that
further research is needed to improve the reliability of Cell Painting
image labeling. Overall, our study provides insights into the importance
of considering different mechanisms of action when predicting a complex
endpoint like mitochondrial disruption as well as into the challenges
and opportunities of using Cell Painting data for toxicity prediction
Predicting the Mitochondrial Toxicity of Small Molecules: Insights from Mechanistic Assays and Cell Painting Data
Mitochondrial toxicity
is a significant concern in the drug discovery
process, as compounds that disrupt the function of these organelles
can lead to serious side effects, including liver injury and cardiotoxicity.
Different in vitro assays exist to detect mitochondrial toxicity at
varying mechanistic levels: disruption of the respiratory chain, disruption
of the membrane potential, or general mitochondrial dysfunction. In
parallel, whole cell imaging assays like Cell Painting provide a phenotypic
overview of the cellular system upon treatment and enable the assessment
of mitochondrial health from cell profiling features. In this study,
we aim to establish machine learning models for the prediction of
mitochondrial toxicity, making the best use of the available data.
For this purpose, we first derived highly curated datasets of mitochondrial
toxicity, including subsets for different mechanisms of action. Due
to the limited amount of labeled data often associated with toxicological
endpoints, we investigated the potential of using morphological features
from a large Cell Painting screen to label additional compounds and
enrich our dataset. Our results suggest that models incorporating
morphological profiles perform better in predicting mitochondrial
toxicity than those trained on chemical structures alone (up to +0.08
and +0.09 mean MCC in random and cluster cross-validation, respectively).
Toxicity labels derived from Cell Painting images improved the predictions
on an external test set up to +0.08 MCC. However, we also found that
further research is needed to improve the reliability of Cell Painting
image labeling. Overall, our study provides insights into the importance
of considering different mechanisms of action when predicting a complex
endpoint like mitochondrial disruption as well as into the challenges
and opportunities of using Cell Painting data for toxicity prediction
Characterization of the Chemical Space of Known and Readily Obtainable Natural Products
Natural products remain one of the
most productive sources of chemical
inspiration for the development of new drugs. The structures of more
than 250 000 natural products are available from public databases.
At least 10% of these compounds are readily obtainable for experimental
testing from commercial vendors and public research institutions.
While the physicochemical properties of known natural products have
been thoroughly studied and compared to those of drugs and other types
of small molecules, the information available on the content, coverage,
and relevance of individual virtual and physical natural product libraries
is clearly limited. The aim of this study was the development of a
detailed understanding of the coverage of chemical space by known
and readily obtainable natural products and by individual natural
product databases. For this purpose, we compiled comprehensive data
sets of known and readily obtainable natural products from 18 virtual
databases (including the Dictionary of Natural Products), nine physical
libraries, and the Protein Data Bank (PDB). We also developed and
employed an algorithm (“SugarBuster”) for the removal
of sugars and sugar-like moieties, which are generally not in the
focus of interest for drug discovery, from natural products. In addition,
we devised a rule-based approach for the automated classification
of natural products into natural product classes (alkaloids, steroids,
flavonoids, etc.). Among the most important results of this study
is the finding that the readily obtainable natural products are highly
diverse and populate regions of chemical space that are of high relevance
to drug discovery. In some cases, substantial differences in the coverage
of natural product classes and chemical space by the individual databases
are observed. More than 2000 natural products are identified for which
at least one X-ray crystal structure of the compound in complex with
a biomacromolecule is available from the PDB
Studying and Mitigating the Effects of Data Drifts on ML Model Performance at the Example of Chemical Toxicity Data
Abstract
Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.</jats:p
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
AbstractMachine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.</jats:p
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
Abstract
Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.</jats:p
Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data
Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models
