41 research outputs found
Measuring reasoning capabilities of ChatGPT
I shall quantify the logical faults generated by ChatGPT when applied to
reasoning tasks. For experiments, I use the 144 puzzles from the library
\url{https://users.utcluj.ro/~agroza/puzzles/maloga}~\cite{groza:fol}. The
library contains puzzles of various types, including arithmetic puzzles,
logical equations, Sudoku-like puzzles, zebra-like puzzles, truth-telling
puzzles, grid puzzles, strange numbers, or self-reference puzzles. The correct
solutions for these puzzles were checked using the theorem prover
Prover9~\cite{mccune2005release} and the finite models finder
Mace4~\cite{mccune2003mace4} based on human-modelling in Equational First Order
Logic. A first output of this study is the benchmark of 100 logical puzzles.
For this dataset ChatGPT provided both correct answer and justification for 7\%
only. %, while BARD for 5\%. Since the dataset seems challenging, the
researchers are invited to test the dataset on more advanced or tuned models
than ChatGPT3.5 with more crafted prompts. A second output is the
classification of reasoning faults conveyed by ChatGPT. This classification
forms a basis for a taxonomy of reasoning faults generated by large language
models. I have identified 67 such logical faults, among which: inconsistencies,
implication does not hold, unsupported claim, lack of commonsense, wrong
justification. The 100 solutions generated by ChatGPT contain 698 logical
faults. That is on average, 7 fallacies for each reasoning task. A third ouput
is the annotated answers of the ChatGPT with the corresponding logical faults.
Each wrong statement within the ChatGPT answer was manually annotated, aiming
to quantify the amount of faulty text generated by the language model. On
average, 26.03\% from the generated text was a logical fault
Case Study: Using AI-Assisted Code Generation In Mobile Teams
The aim of this study is to evaluate the performance of AI-assisted
programming in actual mobile development teams that are focused on native
mobile languages like Kotlin and Swift. The extensive case study involves 16
participants and 2 technical reviewers, from a software development department
designed to understand the impact of using LLMs trained for code generation in
specific phases of the team, more specifically, technical onboarding and
technical stack switch. The study uses technical problems dedicated to each
phase and requests solutions from the participants with and without using
AI-Code generators. It measures time, correctness, and technical integration
using ReviewerScore, a metric specific to the paper and extracted from actual
industry standards, the code reviewers of merge requests. The output is
converted and analyzed together with feedback from the participants in an
attempt to determine if using AI-assisted programming tools will have an impact
on getting developers onboard in a project or helping them with a smooth
transition between the two native development environments of mobile
development, Android and iOS. The study was performed between May and June 2023
with members of the mobile department of a software development company based
in Cluj-Napoca, with Romanian ownership and management.Comment: 8 pages, 10 figures, 1 table, ICCP conferenc
Assuring safety in an air traffic control system with defeasible logic programming
Assuring safety in complex technical systems is a crucial issue in several critical applications like air traffic control or medical devices.
We present a preliminary framework based on argumentation for assisting flight controllers to reach a decision related to safety constraints in an ever changing environment in which sensor data is gathered at real time.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativa (SADIO