163 research outputs found
ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis
We use prompt engineering to guide ChatGPT in the automation of text mining
of metal-organic frameworks (MOFs) synthesis conditions from diverse formats
and styles of the scientific literature. This effectively mitigates ChatGPT's
tendency to hallucinate information -- an issue that previously made the use of
Large Language Models (LLMs) in scientific fields challenging. Our approach
involves the development of a workflow implementing three different processes
for text mining, programmed by ChatGPT itself. All of them enable parsing,
searching, filtering, classification, summarization, and data unification with
different tradeoffs between labor, speed, and accuracy. We deploy this system
to extract 26,257 distinct synthesis parameters pertaining to approximately 800
MOFs sourced from peer-reviewed research articles. This process incorporates
our ChemPrompt Engineering strategy to instruct ChatGPT in text mining,
resulting in impressive precision, recall, and F1 scores of 90-99%.
Furthermore, with the dataset built by text mining, we constructed a
machine-learning model with over 86% accuracy in predicting MOF experimental
crystallization outcomes and preliminarily identifying important factors in MOF
crystallization. We also developed a reliable data-grounded MOF chatbot to
answer questions on chemical reactions and synthesis procedures. Given that the
process of using ChatGPT reliably mines and tabulates diverse MOF synthesis
information in a unified format, while using only narrative language requiring
no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be
very useful across various other chemistry sub-disciplines.Comment: Published on Journal of the American Chemical Society (2023); 102
pages (18-page manuscript, 84 pages of supporting information
GPT-4 Reticular Chemist for MOF Discovery
We present a new framework integrating the AI model GPT-4 into the iterative
process of reticular chemistry experimentation, leveraging a cooperative
workflow of interaction between AI and a human apprentice. This GPT-4 Reticular
Chemist is an integrated system composed of three phases. Each of these
utilizes GPT-4 in various capacities, wherein GPT-4 provides detailed
instructions for chemical experimentation and the apprentice provides feedback
on the experimental outcomes, including both success and failures, for the
in-text learning of AI in the next iteration. This iterative human-AI
interaction enabled GPT-4 to learn from the outcomes, much like an experienced
chemist, by a prompt-learning strategy. Importantly, the system is based on
natural language for both development and operation, eliminating the need for
coding skills, and thus, make it accessible to all chemists. Our GPT-4
Reticular Chemist demonstrated the discovery of an isoreticular series of
metal-organic frameworks (MOFs), each of which was made using distinct
synthesis strategies and optimal conditions. This workflow presents a potential
for broader applications in scientific research by harnessing the capability of
large language models like GPT-4 to enhance the feasibility and efficiency of
research activities.Comment: 163 pages (an 8-page manuscript and 155 pages of supporting
information
System log pre-processing to improve failure prediction
Log preprocessing, a process applied on the raw log be-fore applying a predictive method, is of paramount impor-tance to failure prediction and diagnosis. While existing fil-tering methods have demonstrated good compression rate, they fail to preserve important failure patterns that are cru-cial for failure analysis. To address the problem, in this paper we present a log preprocessing method. It consists of three integrated steps: (1) event categorization to uni-formly classify system events and identify fatal events; (2) event filtering to remove temporal and spatial redundant records, while also preserving necessary failure patterns for failure analysis; (3) causality-related filtering to com-bine correlated events for filtering through apriori associ-ation rule mining. We demonstrate the effectiveness of our preprocessing method by using real failure logs collected from the Cray XT4 at ORNL and the Blue Gene/L system at SDSC. Experiments show that our method can preserve more failure patterns for failure analysis, thereby improv-ing failure prediction by up to 174%
Deep-STP: a deep learning-based approach to predict snake toxin proteins by using word embeddings
Snake venom contains many toxic proteins that can destroy the circulatory system or nervous system of prey. Studies have found that these snake venom proteins have the potential to treat cardiovascular and nervous system diseases. Therefore, the study of snake venom protein is conducive to the development of related drugs. The research technologies based on traditional biochemistry can accurately identify these proteins, but the experimental cost is high and the time is long. Artificial intelligence technology provides a new means and strategy for large-scale screening of snake venom proteins from the perspective of computing. In this paper, we developed a sequence-based computational method to recognize snake toxin proteins. Specially, we utilized three different feature descriptors, namely g-gap, natural vector and word 2 vector, to encode snake toxin protein sequences. The analysis of variance (ANOVA), gradient-boost decision tree algorithm (GBDT) combined with incremental feature selection (IFS) were used to optimize the features, and then the optimized features were input into the deep learning model for model training. The results show that our model can achieve a prediction performance with an accuracy of 82.00% in 10-fold cross-validation. The model is further verified on independent data, and the accuracy rate reaches to 81.14%, which demonstrated that our model has excellent prediction performance and robustness
Lipid profiles in the cerebrospinal fluid of rats with 6-hydroxydopamine-induced lesions as a model of Parkinson’s disease
BackgroundParkinson’s disease (PD) is a progressive neurodegenerative disease with characteristic pathological abnormalities, including the loss of dopaminergic (DA) neurons, a dopamine-depleted striatum, and microglial activation. Lipid accumulation exhibits a close relationship with these pathologies in PD.MethodsHere, 6-hydroxydopamine (6-OHDA) was used to construct a rat model of PD, and the lipid profile in cerebrospinal fluid (CSF) obtained from model rats was analyzed using lipidomic approaches.ResultsEstablishment of this PD model was confirmed by apomorphine-induced rotation behaviors, loss of DA neurons, depletion of dopamine in the striatum, and microglial activation after 6-OHDA-induced lesion generation. Unsupervised and supervised methods were employed for lipid analysis. A total of 172 lipid species were identified in CSF and subsequently classified into 18 lipid families. Lipid families, including eicosanoids, triglyceride (TG), cholesterol ester (CE), and free fatty acid (FFA), and 11 lipid species exhibited significantly altered profiles 2 weeks after 6-OHDA administration, and significant changes in eicosanoids, TG, CE, CAR, and three lipid species were noted 5 weeks after 6-OHDA administration. During the period of 6-OHDA-induced lesion formation, the lipid families and species showed concentration fluctuations related to the recovery of behavior and nigrostriatal abnormalities. Correlation analysis showed that the levels of eicosanoids, CE, TG families, and TG (16:0_20:0_18:1) exhibited positive relationships with apomorphine-induced rotation behaviors and negative relationships with tyrosine hydroxylase (TH) expression in the midbrain.ConclusionThese results revealed that non-progressive nigrostriatal degeneration induced by 6-OHDA promotes the expression of an impairment-related lipidomic signature in CSF, and the level of eicosanoids, CE, TG families, and TG (16:0_20:0_18:1) in CSF may reveal pathological changes in the midbrain after 6-OHDA insult
Evaluation of individual and ensemble probabilistic forecasts of COVID-19 mortality in the United States
Short-term probabilistic forecasts of the trajectory of the COVID-19 pandemic in the United States have served as a visible and important communication channel between the scientific modeling community and both the general public and decision-makers. Forecasting models provide specific, quantitative, and evaluable predictions that inform short-term decisions such as healthcare staffing needs, school closures, and allocation of medical supplies. Starting in April 2020, the US COVID-19 Forecast Hub (https://covid19forecasthub.org/) collected, disseminated, and synthesized tens of millions of specific predictions from more than 90 different academic, industry, and independent research groups. A multimodel ensemble forecast that combined predictions from dozens of groups every week provided the most consistently accurate probabilistic forecasts of incident deaths due to COVID-19 at the state and national level from April 2020 through October 2021. The performance of 27 individual models that submitted complete forecasts of COVID-19 deaths consistently throughout this year showed high variability in forecast skill across time, geospatial units, and forecast horizons. Two-thirds of the models evaluated showed better accuracy than a naïve baseline model. Forecast accuracy degraded as models made predictions further into the future, with probabilistic error at a 20-wk horizon three to five times larger than when predicting at a 1-wk horizon. This project underscores the role that collaboration and active coordination between governmental public-health agencies, academic modeling teams, and industry partners can play in developing modern modeling capabilities to support local, state, and federal response to outbreaks
The United States COVID-19 Forecast Hub dataset
Academic researchers, government agencies, industry groups, and individuals have produced forecasts at an unprecedented scale during the COVID-19 pandemic. To leverage these forecasts, the United States Centers for Disease Control and Prevention (CDC) partnered with an academic research lab at the University of Massachusetts Amherst to create the US COVID-19 Forecast Hub. Launched in April 2020, the Forecast Hub is a dataset with point and probabilistic forecasts of incident cases, incident hospitalizations, incident deaths, and cumulative deaths due to COVID-19 at county, state, and national, levels in the United States. Included forecasts represent a variety of modeling approaches, data sources, and assumptions regarding the spread of COVID-19. The goal of this dataset is to establish a standardized and comparable set of short-term forecasts from modeling teams. These data can be used to develop ensemble models, communicate forecasts to the public, create visualizations, compare models, and inform policies regarding COVID-19 mitigation. These open-source data are available via download from GitHub, through an online API, and through R packages
- …