67,792 research outputs found
Using ChatGPT and Other AI Engines to Vocalize Medieval Hebrew
Hebrew is usually written without vowel points, making it challenging for some readers to decipher. This is especially true of medieval Hebrew, which can have nonstandard grammar and orthography. This paper tested four artificial intelligence (AI) tools by asking them to add vowel points to an unpublished medieval Hebrew translation of the Lord’s Prayer. The vocalization tools tested were OpenAI’s ChatGPT-3.5 and ChatGPT-4, Pellaworks’ DoItInHebrew, and Dicta’s Nakdan. ChatGPT-3.5 freely changed the text, even rewriting some phrases and adding an entire sentence. ChatGPT-3.5 also provided erroneous vowels in its rewritten Hebrew text. ChatGPT-4 did a moderately good job with only a few errors, but also modified the orthography. One of ChatGPT-4’s errors was not trivial, resulting in the invention of a word. When challenged, ChatGPT-4 corrected this confabulation by inventing another word, which it claimed was a “rare form” for which it provided a fictitious derivation. When challenged on this second made-up word, ChatGPT-4 replaced the word from the input text with a word based on an entirely different root. DoItInHebrew inserted vowels that produced a gibberish text. In contrast, Dicta’s Nakdan provided near perfect vocalization, with only one genuine error, but like ChatGPT-4 it modified the orthography. ChatGPT-3.5, ChatGPT-4, and DoItInHebrew exhibited serious “hallucinations,” of both the “factual” and the “untruthful” varieties, typical of other AIs, making them counterproductive for vocalizing historic Hebrew texts. Nakdan can be a powerful tool but still requires someone with expertise in Hebrew grammar to verify and correct the vocalization. Nakdan’s interface simplified correcting the vocalization, although it required its user to have advanced knowledge of Hebrew
Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy
BACKGROUND/AIMS: To compare the performance of generative versus retrieval-based chatbots in answering patient inquiries regarding age-related macular degeneration (AMD) and diabetic retinopathy (DR). METHODS: We evaluated four chatbots: generative models (ChatGPT-4, ChatGPT-3.5 and Google Bard) and a retrieval-based model (OcularBERT) in a cross-sectional study. Their response accuracy to 45 questions (15 AMD, 15 DR and 15 others) was evaluated and compared. Three masked retinal specialists graded the responses using a three-point Likert scale: either 2 (good, error-free), 1 (borderline) or 0 (poor with significant inaccuracies). The scores were aggregated, ranging from 0 to 6. Based on majority consensus among the graders, the responses were also classified as ‘Good’, ‘Borderline’ or ‘Poor’ quality. RESULTS: Overall, ChatGPT-4 and ChatGPT-3.5 outperformed the other chatbots, both achieving median scores (IQR) of 6 (1), compared with 4.5 (2) in Google Bard, and 2 (1) in OcularBERT (all p ≤8.4×10−3). Based on the consensus approach, 83.3% of ChatGPT-4’s responses and 86.7% of ChatGPT-3.5’s were rated as ‘Good’, surpassing Google Bard (50%) and OcularBERT (10%) (all p ≤1.4×10−2). ChatGPT-4 and ChatGPT-3.5 had no ‘Poor’ rated responses. Google Bard produced 6.7% Poor responses, and OcularBERT produced 20%. Across question types, ChatGPT-4 outperformed Google Bard only for AMD, and ChatGPT-3.5 outperformed Google Bard for DR and others. CONCLUSION: ChatGPT-4 and ChatGPT-3.5 demonstrated superior performance, followed by Google Bard and OcularBERT. Generative chatbots are potentially capable of answering domain-specific questions outside their original training. Further validation studies are still required prior to real-world implementation
Benchmarking ChatGPT-4 on ACR Radiation Oncology In-Training (TXIT) Exam and Red Journal Gray Zone Cases: Potentials and Challenges for AI-Assisted Medical Education and Decision Making in Radiation Oncology
The potential of large language models in medicine for education and decision
making purposes has been demonstrated as they achieve decent scores on medical
exams such as the United States Medical Licensing Exam (USMLE) and the MedQA
exam. In this work, we evaluate the performance of ChatGPT-4 in the specialized
field of radiation oncology using the 38th American College of Radiology (ACR)
radiation oncology in-training (TXIT) exam and the 2022 Red Journal gray zone
cases. For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of
63.65% and 74.57%, respectively, highlighting the advantage of the latest
ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in
radiation oncology are identified to some extent. Specifically, ChatGPT-4
demonstrates good knowledge of statistics, CNS & eye, pediatrics, biology, and
physics but has limitations in bone & soft tissue and gynecology, as per the
ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs well in
diagnosis, prognosis, and toxicity but lacks proficiency in topics related to
brachytherapy and dosimetry, as well as in-depth questions from clinical
trials. For the gray zone cases, ChatGPT-4 is able to suggest a personalized
treatment approach to each case with high correctness and comprehensiveness.
Most importantly, it provides novel treatment aspects for many cases, which are
not suggested by any human experts. Both evaluations demonstrate the potential
of ChatGPT-4 in medical education for the general public and cancer patients,
as well as the potential to aid clinical decision-making, while acknowledging
its limitations in certain domains. Because of the risk of hallucination, facts
provided by ChatGPT always need to be verified
What Did I Miss? A Demonstration of the Differences Between ChatGPT-4 and 3.5 that Impact Legal Research and Writing
Many news sources are raving about how much more advanced ChatGPT-4 is than 3.5. You may have heard that ChatGPT-4 outscored 90% of test takers on the Uniform Bar Exam, while ChatGPT 3.5 only outscored 10% of test takers. But what does this mean for teaching legal research and writing? In this presentation, we will compare specific examples of ChatGPT 3.5 (the free version many of us tried in the spring) and ChatGPT-4 (the paid version released in March)
Evaluating AI-Generated informed consent documents in oral surgery: A comparative study of ChatGPT-4, Bard gemini advanced, and human-written consents
This study evaluates the quality and readability of informed consent documents generated by AI platforms ChatGPT-4 and Bard Gemini Advanced compared to those written by a first-year oral surgery resident for common oral surgery procedures. The evaluation, conducted by 18 experienced oral and maxillofacial surgeons, assessed consents for accuracy, completeness, readability, and overall quality. ChatGPT-4 consistently outperformed both Bard and human-written consents. ChatGPT-4 consents had a median accuracy score of 4 [IQR 4-4], compared to Bard's 3 [IQR 3–4] and human's 4 [IQR 3–4]. Completeness scores were higher for ChatGPT-4 (4 [IQR 4–5]) than Bard (3 [IQR 3–4]) and human (4 [IQR 3–4]). Readability was also superior for ChatGPT-4, with a median score of 4 [IQR 4–5] compared to Bard and human consents, both at 4 [IQR 4-4] and 4 [IQR 3–4], respectively. The Gunning Fog Index for ChatGPT-4 was 17.2 [IQR 16.5–18.2], better than Bard's 23.1 [IQR 20.5–24.7] and the human consents' 20 [IQR 19.2–20.9]. Overall, ChatGPT-4's consents received the highest quality ratings, underscoring AI's potential in enhancing patient communication and the informed consent process. The study suggests AI can reduce misinformation risks and improve patient understanding, but continuous evaluation, oversight, and patient feedback integration are crucial to ensure the effectiveness and appropriateness of AI-generated content in clinical practice
Assessing ChatGPT’s theoretical knowledge and prescriptive accuracy in bacterial infections: a comparative study with infectious diseases residents and specialists
Objectives: Advancements in Artificial Intelligence(AI) have made platforms like ChatGPT increasingly relevant in medicine. This study assesses ChatGPT's utility in addressing bacterial infection-related questions and antibiogram-based clinical cases. Methods: This study involved a collaborative effort involving infectious disease (ID) specialists and residents. A group of experts formulated six true/false, six open-ended questions, and six clinical cases with antibiograms for four types of infections (endocarditis, pneumonia, intra-abdominal infections, and bloodstream infection) for a total of 96 questions. The questions were submitted to four senior residents and four specialists in ID and inputted into ChatGPT-4 and a trained version of ChatGPT-4. A total of 720 responses were obtained and reviewed by a blinded panel of experts in antibiotic treatments. They evaluated the responses for accuracy and completeness, the ability to identify correct resistance mechanisms from antibiograms, and the appropriateness of antibiotics prescriptions. Results: No significant difference was noted among the four groups for true/false questions, with approximately 70% correct answers. The trained ChatGPT-4 and ChatGPT-4 offered more accurate and complete answers to the open-ended questions than both the residents and specialists. Regarding the clinical case, we observed a lower accuracy from ChatGPT-4 to recognize the correct resistance mechanism. ChatGPT-4 tended not to prescribe newer antibiotics like cefiderocol or imipenem/cilastatin/relebactam, favoring less recommended options like colistin. Both trained- ChatGPT-4 and ChatGPT-4 recommended longer than necessary treatment periods (p-value = 0.022). Conclusions: This study highlights ChatGPT's capabilities and limitations in medical decision-making, specifically regarding bacterial infections and antibiogram analysis. While ChatGPT demonstrated proficiency in answering theoretical questions, it did not consistently align with expert decisions in clinical case management. Despite these limitations, the potential of ChatGPT as a supportive tool in ID education and preliminary analysis is evident. However, it should not replace expert consultation, especially in complex clinical decision-making
ChatGPT-4 with Code Interpreter can be used to solve introductory college-level vector calculus and electromagnetism problems
We evaluated ChatGPT 3.5, 4, and 4 with Code Interpreter on a set of
college-level engineering-math and electromagnetism problems, such as those
often given to sophomore electrical engineering majors. We selected a set of 13
problems, and had ChatGPT solve them multiple times, using a fresh instance
(chat) each time. We found that ChatGPT-4 with Code Interpreter was able to
satisfactorily solve most problems we tested most of the time -- a major
improvement over the performance of ChatGPT-4 (or 3.5) without Code
Interpreter. The performance of ChatGPT was observed to be somewhat stochastic,
and we found that solving the same problem N times in new ChatGPT instances and
taking the most-common answer was an effective strategy. Based on our findings
and observations, we provide some recommendations for instructors and students
of classes at this level.Comment: Main text and appendice
Evaluating ChatGPT-4’s historical accuracy: a case study on the origins of SWOT analysis
In this study we test ChatGPT-4’s ability to provide accurate information about the origins and evolution of SWOT analysis, perhaps the most widely used strategy tool in practice worldwide. ChatGPT-4 is tested for historical accuracy and hallucinations. The API is prompted using a Python script with a series of structured questions from an Excel file and the results are recorded in another Excel file and rated on a binary scale. Our findings present a nuanced view of ChatGPT-4’s capabilities. We observe that while ChatGPT-4 demonstrates a high level of proficiency in describing and outlining the general concept of SWOT analysis, there are notable discrepancies when it comes to detailing its origins and evolution. These inaccuracies range from minor factual errors to more serious hallucinations that deviate from evidence in scholarly publications. However, we also find that ChatGPT-4 comes up with spontaneous historically accurate facts. Our interpretation of the result is that ChatGPT is largely trained on easily available websites and to a very limited extent has been trained on scholarly publications on SWOT analysis, especially when these are behind a paywall. We conclude with four propositions for future research.publishedVersio
Assessment of Quality and Readability of Information Provided by ChatGPT in Relation to Anterior Cruciate Ligament Injury
The aim of our study was to evaluate the potential role of Artificial Intelligence tools like ChatGPT in patient education. To do this, we assessed both the quality and readability of information provided by ChatGPT 3.5 and 4 in relation to Anterior Cruciate Ligament (ACL) injury and treatment. ChatGPT 3.5 and 4 were used to answer common patient queries relating to ACL injuries and treatment. The quality of the information was assessed using the DISCERN criteria. Readability was assessed with the use of seven readability formulae: the Flesch-Kincaid Reading Grade Level, the Flesch Reading Ease Score, the Raygor Estimate, the SMOG, the Fry, the FORCAST, and the Gunning Fog. The mean reading grade level (RGL) was compared with the recommended 8th-grade reading level, the mean RGL among adults in America. The perceived quality and mean RGL of answers given by both ChatGPT 3.5 and 4 was also compared. Both ChatGPT 3.5 and 4 yielded DISCERN scores suggesting "good" quality of information, with ChatGPT 4 slightly outperforming 3.5. However, readability levels for both versions significantly exceeded the average 8th-grade reading level for American patients. ChatGPT 3.5 had a mean RGL of 18.08, while the mean RGL of ChatGPT 4 was 17.9, exceeding the average American reading grade level by 10.08 grade levels and 9.09 grade levels, respectively. While ChatGPT can provide both reliable and good quality information on ACL injuries and treatment options, the readability of the content may limit its utility. Additionally, the consistent lack of source citation represents a significant area of concern for patients and clinicians alike. If AI is to play a role in patient education, it must reliably produce information which is accurate, easily comprehensible, and clearly sourced
Evaluating Large Language Models on a Highly-specialized Topic, Radiation Oncology Physics
We present the first study to investigate Large Language Models (LLMs) in
answering radiation oncology physics questions. Because popular exams like AP
Physics, LSAT, and GRE have large test-taker populations and ample test
preparation resources in circulation, they may not allow for accurately
assessing the true potential of LLMs. This paper proposes evaluating LLMs on a
highly-specialized topic, radiation oncology physics, which may be more
pertinent to scientific and medical communities in addition to being a valuable
benchmark of LLMs. We developed an exam consisting of 100 radiation oncology
physics questions based on our expertise at Mayo Clinic. Four LLMs, ChatGPT
(GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against
medical physicists and non-experts. ChatGPT (GPT-4) outperformed all other LLMs
as well as medical physicists, on average. The performance of ChatGPT (GPT-4)
was further improved when prompted to explain first, then answer. ChatGPT
(GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices
across a number of trials, whether correct or incorrect, a characteristic that
was not observed in the human test groups. In evaluating ChatGPTs (GPT-4)
deductive reasoning ability using a novel approach (substituting the correct
answer with "None of the above choices is the correct answer."), ChatGPT
(GPT-4) demonstrated surprising accuracy, suggesting the potential presence of
an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall,
its intrinsic properties did not allow for further improvement when scoring
based on a majority vote across trials. In contrast, a team of medical
physicists were able to greatly outperform ChatGPT (GPT-4) using a majority
vote. This study suggests a great potential for LLMs to work alongside
radiation oncology experts as highly knowledgeable assistants
- …
