4 research outputs found
The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS
BACKGROUND: Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure-activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models. RESULTS: We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (http://ochem.eu). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed. CONCLUSIONS: We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure-activity relationship models. The developed models and data are publicly available at http://ochem.eu/article/99826
A Computational perspective on the concerted cleavage mechanism of the natural targets of HIV-1 protease.
Doctoral Degree. University of KwaZulu-Natal, Durban.One infectious disease that has had both a profound health and cultural impact on the human race
in recent decades is the Acquired Immune Deficiency Syndrome (AIDS) caused by the Human
Immunodeficiency Virus (HIV). A major breakthrough in the treatment of HIV-1 was the use of
drugs inhibiting specific enzymes necessary for the replication of the virus. Among these
enzymes is HIV-1 protease (PR), which is an important degrading enzyme necessary for the
proteolytic cleavage of the Gag and Gag-Pol polyproteins, required for the development of
mature virion proteins. The mechanism of action of the HIV-1 PR on the proteolysis of these
polyproteins has been a subject of research over the past three decades.
Most investigations on this subject have been dedicated to exploring the reaction mechanism of
HIV-1 PR on its targets as a stepwise general acid-base process with little attention on a
concerted model. One of the shortcomings of the stepwise reaction pathway is the existence of
more than two TS moieties, which have led to varying opinions on the exact rate-determining
step of the reaction and the protonation pattern of the catalytic aspartate group at the HIV-1 PR
active site. Also, there is no consensus on the actual recognition mechanism of the natural
substrates by the HIV-1 PR.
By means of concerted transition state (TS) structural models, the recognition mode and the
reaction mechanism of HIV-1 PR with its natural targets were investigated in this present study.
The investigation was designed to elucidate the cleavage of natural substrates by HIV-1 PR using
the concerted TS model through the application of computational methods to unravel the
recognition and reaction process, compute activation parameters and elucidate quantum chemical
properties of the system.
Quantum mechanics (QM) methods including the density functional theory (DFT) models and
Hartree-Fock (HF), molecular mechanics (MM) and hybrid QM/MM were employed to provide
better insight in this topic. Based on experience with concerted TS modelling, the six-membered
ring TS structure was proposed. Using a small model system and QM methods (DFT and HF),
the enzymatic mechanism of HIV-1 PR was studied as a general acid-base model having both
catalytic aspartate group participating and water molecule attacking the natural substrate
synchronously. The natural substrate scissile bond strength was also investigated via changes of
electronic effects. The proposed concerted six-membered ring TS mechanism of the natural
substrate within the entire enzyme was studied using hybrid QM/MM; “Our own N-layered Integrated molecular Orbital and molecular Mechanics” (ONIOM) method. This investigation
led us to a new perspective in which an acyclic concerted pathway provided a better approach to
the subject than the proposed six-membered model. The natural substrate recognition pattern
was therefore investigated using the concerted acyclic TS modelling to examine if HIV-1 (South
Africa subtype C, C-SA and subtype B) PRs recognize their substrates in the same manner using
ONIOM approach.
A major outcome in the present investigation is the computational modelling of a new,
potentially active, substrate-based inhibitor through the six-membered concerted cyclic TS
modelling and a small system. By modelling the entire enzyme—substrate system using a
hybrid QM/MM (ONIOM) method, three different pathways were obtained. (1) A concerted
acyclic TS structure, (2) a concerted six-membered cyclic TS model and (3) another sixmembered
ring TS model involving two water molecules. The activation free energies obtained
for the first and the last pathways were in agreement with in vitro HIV-1 PR hydrolysis data.
The mechanism that provides marginally the lowest activation barrier involves an acyclic TS
model with one water molecule at the HIV-1 PR active site. The outcome of the study provides
a plausible theoretical benchmark for the concerted enzymatic mechanism of HIV-1 PRs which
could be applied to related homodimeric protease and perhaps other enzymatic processes.
Applying the one-step concerted acyclic catalytic mechanism for two HIV-1 PR subtypes, the
recognition phenomena of both enzyme and substrate were studied. It was observed that the
studied HIV-1 PR subtypes (B and C-SA) recognize and cleave at both scissile and non-scissile
regions of the natural substrate sequences and maintaining preferential specificity for the scissile
bonds with characteristic lower activation free energies.
Future studies on the reaction mechanism of HIV-1 PR and natural substrates should involve the
application of advanced computational techniques to provide plausible answers to some
unresolved perspectives. Theoretical investigations on the enzymatic mechanism of HIV-1 PR—
natural substrate in years to come, would likely involve the application of sophisticated
computational techniques aimed at exploring more than the energetics of the system. The
possibility of integrated computational algorithms which do not involve
partitioning/restraining/constraining/cropped model systems of the enzyme—substrate
mechanism would likely surface in future to accurately elucidate the HIV-1 PR catalytic process on natural substrates/ligands
Recommended from our members
Contributions to evaluation of machine learning models. Applicability domain of classification models
Artificial intelligence (AI) and machine learning (ML) present some application opportunities and
challenges that can be framed as learning problems. The performance of machine learning models
depends on algorithms and the data. Moreover, learning algorithms create a model of reality through
learning and testing with data processes, and their performance shows an agreement degree of their
assumed model with reality. ML algorithms have been successfully used in numerous classification
problems. With the developing popularity of using ML models for many purposes in different domains,
the validation of such predictive models is currently required more formally. Traditionally, there are
many studies related to model evaluation, robustness, reliability, and the quality of the data and the
data-driven models. However, those studies do not consider the concept of the applicability domain
(AD) yet. The issue is that the AD is not often well defined, or it is not defined at all in many fields. This
work investigates the robustness of ML classification models from the applicability domain
perspective. A standard definition of applicability domain regards the spaces in which the model
provides results with specific reliability.
The main aim of this study is to investigate the connection between the applicability domain approach
and the classification model performance. We are examining the usefulness of assessing the AD for
the classification model, i.e. reliability, reuse, robustness of classifiers. The work is implemented using
three approaches, and these approaches are conducted in three various attempts: firstly, assessing
the applicability domain for the classification model; secondly, investigating the robustness of the
classification model based on the applicability domain approach; thirdly, selecting an optimal model
using Pareto optimality. The experiments in this work are illustrated by considering different machine
learning algorithms for binary and multi-class classifications for healthcare datasets from public
benchmark data repositories. In the first approach, the decision trees algorithm (DT) is used for the
classification of data in the classification stage. The feature selection method is applied to choose
features for classification. The obtained classifiers are used in the third approach for selection of
models using Pareto optimality. The second approach is implemented using three steps; namely,
building classification model; generating synthetic data; and evaluating the obtained results.
The results obtained from the study provide an understanding of how the proposed approach can help
to define the model’s robustness and the applicability domain, for providing reliable outputs. These
approaches open opportunities for classification data and model management. The proposed
algorithms are implemented through a set of experiments on classification accuracy of instances,
which fall in the domain of the model. For the first approach, by considering all the features, the
highest accuracy obtained is 0.98, with thresholds average of 0.34 for Breast cancer dataset. After
applying recursive feature elimination (RFE) method, the accuracy is 0.96% with 0.27 thresholds
average. For the robustness of the classification model based on the applicability domain approach,
the minimum accuracy is 0.62% for Indian Liver Patient data at r=0.10, and the maximum accuracy is
0.99% for Thyroid dataset at r=0.10. For the selection of an optimal model using Pareto optimality,
the optimally selected classifier gives the accuracy of 0.94% with 0.35 thresholds average.
This research investigates critical aspects of the applicability domain as related to the robustness of
classification ML algorithms. However, the performance of machine learning techniques depends on
the degree of reliable predictions of the model. In the literature, the robustness of the ML model can
be defined as the ability of the model to provide the testing error close to the training error. Moreover,
the properties can describe the stability of the model performance when being tested on the new
datasets. Concluding, this thesis introduced the concept of applicability domain for classifiers and
tested the use of this concept with some case studies on health-related public benchmark datasets.Ministry of Higher Education in Liby
The perspectives of computational chemistry modeling.
The on-line tools for computational chemistry modeling will be increasingly used in the future. This will bring the advantages both for the authors and the readers