22 research outputs found
ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis
We use prompt engineering to guide ChatGPT in the automation of text mining
of metal-organic frameworks (MOFs) synthesis conditions from diverse formats
and styles of the scientific literature. This effectively mitigates ChatGPT's
tendency to hallucinate information -- an issue that previously made the use of
Large Language Models (LLMs) in scientific fields challenging. Our approach
involves the development of a workflow implementing three different processes
for text mining, programmed by ChatGPT itself. All of them enable parsing,
searching, filtering, classification, summarization, and data unification with
different tradeoffs between labor, speed, and accuracy. We deploy this system
to extract 26,257 distinct synthesis parameters pertaining to approximately 800
MOFs sourced from peer-reviewed research articles. This process incorporates
our ChemPrompt Engineering strategy to instruct ChatGPT in text mining,
resulting in impressive precision, recall, and F1 scores of 90-99%.
Furthermore, with the dataset built by text mining, we constructed a
machine-learning model with over 86% accuracy in predicting MOF experimental
crystallization outcomes and preliminarily identifying important factors in MOF
crystallization. We also developed a reliable data-grounded MOF chatbot to
answer questions on chemical reactions and synthesis procedures. Given that the
process of using ChatGPT reliably mines and tabulates diverse MOF synthesis
information in a unified format, while using only narrative language requiring
no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be
very useful across various other chemistry sub-disciplines.Comment: Published on Journal of the American Chemical Society (2023); 102
pages (18-page manuscript, 84 pages of supporting information
Learning to Evolve Structural Ensembles of Unfolded and Disordered Proteins Using Experimental Solution Data
We have developed a Generative Recurrent Neural Networks (GRNN) that learns
the probability of the next residue torsions $X_{i+1}=\
[\phi_{i+1},\psi_{i+1},\omega _{i+1}, \chi_{i+1}]X_i$ to generate new IDP conformations. In addition, we couple
the GRNN with a Bayesian model, X-EISD, in a reinforcement learning step that
biases the probability distributions of torsions to take advantage of
experimental data types such as J-couplingss, NOEs and PREs. We show that
updating the generative model parameters according to the reward feedback on
the basis of the agreement between structures and data improves upon existing
approaches that simply reweight static structural pools for disordered
proteins. Instead the GRNN "DynamICE" model learns to physically change the
conformations of the underlying pool to those that better agree with
experiment
Learning Correlations between Internal Coordinates to Improve 3D Cartesian Coordinates for Proteins.
Recommended from our members
ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis
Recommended from our members
Local Disordered Region Sampling (LDRS) for ensemble modeling of proteins with experimentally undetermined or low confidence prediction segments.
SUMMARY: The Local Disordered Region Sampling (LDRS, pronounced loaders) tool is a new module developed for IDPConformerGenerator, a previously validated approach to model intrinsically disordered proteins (IDPs). The IDPConformerGenerator LDRS module provides a method for generating all-atom conformations of intrinsically disordered protein regions at N- and C-termini of and in loops or linkers between folded regions of an existing protein structure. These disordered elements often lead to missing coordinates in experimental structures or low confidence in predicted structures. Requiring only a pre-existing PDB or mmCIF formatted structural template of the protein with missing coordinates or with predicted confidence scores and its full-length primary sequence, LDRS will automatically generate physically meaningful conformational ensembles of the missing flexible regions to complete the full-length protein. The capabilities of the LDRS tool of IDPConformerGenerator include modeling phosphorylation sites using enhanced Monte Carlo-Side Chain Entropy, transmembrane proteins within an all-atom bilayer, and multi-chain complexes. The modeling capacity of LDRS capitalizes on the modularity, the ability to be used as a library and via command-line, and the computational speed of the IDPConformerGenerator platform. AVAILABILITY AND IMPLEMENTATION: The LDRS module is part of the IDPConformerGenerator modeling suite, which can be downloaded from GitHub at https://github.com/julie-forman-kay-lab/IDPConformerGenerator. IDPConformerGenerator is written in Python3 and works on Linux, Microsoft Windows, and Mac OS versions that support DSSP. Users can utilize LDRSs Python API for scripting the same way they can use any part of IDPConformerGenerators API, by importing functions from the idpconfgen.ldrs_helper library. Otherwise, LDRS can be used as a command line interface application within IDPConformerGenerator. Full documentation is available within the command-line interface as well as on IDPConformerGenerators official documentation pages (https://idpconformergenerator.readthedocs.io/en/latest/)
Recommended from our members
Learning Correlations between Internal Coordinates to Improve 3D Cartesian Coordinates for Proteins.
We consider a generic representation problem of internal coordinates (bond lengths, valence angles, and dihedral angles) and their transformation to 3-dimensional Cartesian coordinates of a biomolecule. We show that the internal-to-Cartesian process relies on correctly predicting chemically subtle correlations among the internal coordinates themselves, and learning these correlations increases the fidelity of the Cartesian representation. We developed a machine learning algorithm, Int2Cart, to predict bond lengths and bond angles from backbone torsion angles and residue types of a protein, which allows reconstruction of protein structures better than using fixed bond lengths and bond angles or a static library method that relies on backbone torsion angles and residue types in a local environment. The method is able to be used for structure validation, as we show that the agreement between Int2Cart-predicted bond geometries and those from an AlphaFold 2 model can be used to estimate model quality. Additionally, by using Int2Cart to reconstruct an IDP ensemble, we are able to decrease the clash rate during modeling. The Int2Cart algorithm has been implemented as a publicly accessible python package at https://github.com/THGLab/int2cart
Supplemental Archive for LDRS
<p>Supplemental archive for the manuscript: Local Disordered Region Sampling (LDRS) for Ensemble Modeling of Proteins with Experimentally Undetermined or Low Confidence Prediction Segments.</p>
Recommended from our members
Protein Dynamics to Define and Refine Disordered Protein Ensembles
Intrinsically disordered proteins and unfolded proteins have fluctuating conformational ensembles that are fundamental to their biological function and impact protein folding, stability, and misfolding. Despite the importance of protein dynamics and conformational sampling, time-dependent data types are not fully exploited when defining and refining disordered protein ensembles. Here we introduce a computational framework using an elastic network model and normal-mode displacements to generate a dynamic disordered ensemble consistent with NMR-derived dynamics parameters, including transverse R2 relaxation rates and Lipari-Szabo order parameters (S2 values). We illustrate our approach using the unfolded state of the drkN SH3 domain to show that the dynamical ensembles give better agreement than a static ensemble for a wide range of experimental validation data including NMR chemical shifts, J-couplings, nuclear Overhauser effects, paramagnetic relaxation enhancements, residual dipolar couplings, hydrodynamic radii, single-molecule fluorescence Förster resonance energy transfer, and small-angle X-ray scattering
A benchmark dataset for Hydrogen Combustion.
The generation of reference data for deep learning models is challenging for reactive systems, and more so for combustion reactions due to the extreme conditions that create radical species and alternative spin states during the combustion process. Here, we extend intrinsic reaction coordinate (IRC) calculations with ab initio MD simulations and normal mode displacement calculations to more extensively cover the potential energy surface for 19 reaction channels for hydrogen combustion. A total of ∼290,000 potential energies and ∼1,270,000 nuclear force vectors are evaluated with a high quality range-separated hybrid density functional, ωB97X-V, to construct the reference data set, including transition state ensembles, for the deep learning models to study hydrogen combustion reaction