14 research outputs found
Using Large Language Models to Support Thematic Analysis in Empirical Legal Studies
Thematic analysis and other variants of inductive coding are widely used
qualitative analytic methods within empirical legal studies (ELS). We propose a
novel framework facilitating effective collaboration of a legal expert with a
large language model (LLM) for generating initial codes (phase 2 of thematic
analysis), searching for themes (phase 3), and classifying the data in terms of
the themes (to kick-start phase 4). We employed the framework for an analysis
of a dataset (n=785) of facts descriptions from criminal court opinions
regarding thefts. The goal of the analysis was to discover classes of typical
thefts. Our results show that the LLM, namely OpenAI's GPT-4, generated
reasonable initial codes, and it was capable of improving the quality of the
codes based on expert feedback. They also suggest that the model performed well
in zero-shot classification of facts descriptions in terms of the themes.
Finally, the themes autonomously discovered by the LLM appear to map fairly
well to the themes arrived at by legal experts. These findings can be leveraged
by legal researchers to guide their decisions in integrating LLMs into their
thematic analyses, as well as other inductive coding projects.Comment: 10 pages, 5 figures, 3 table
Efficient Classification of Student Help Requests in Programming Courses Using Large Language Models
The accurate classification of student help requests with respect to the type
of help being sought can enable the tailoring of effective responses.
Automatically classifying such requests is non-trivial, but large language
models (LLMs) appear to offer an accessible, cost-effective solution. This
study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying
help requests from students in an introductory programming class. In zero-shot
trials, GPT-3.5 and GPT-4 exhibited comparable performance on most categories,
while GPT-4 outperformed GPT-3.5 in classifying sub-categories for requests
related to debugging. Fine-tuning the GPT-3.5 model improved its performance to
such an extent that it approximated the accuracy and consistency across
categories observed between two human raters. Overall, this study demonstrates
the feasibility of using LLMs to enhance educational systems through the
automated classification of student needs
Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code
We analyzed effectiveness of three generative pre-trained transformer (GPT)
models in answering multiple-choice question (MCQ) assessments, often involving
short snippets of code, from introductory and intermediate programming courses
at the postsecondary level. This emerging technology stirs countless
discussions of its potential uses (e.g., exercise generation, code explanation)
as well as misuses in programming education (e.g., cheating). However, the
capabilities of GPT models and their limitations to reason about and/or analyze
code in educational settings have been under-explored. We evaluated several
OpenAI's GPT models on formative and summative MCQ assessments from three
Python courses (530 questions). We found that MCQs containing code snippets are
not answered as successfully as those that only contain natural language. While
questions requiring to fill-in a blank in the code or completing a natural
language statement about the snippet are handled rather successfully, MCQs that
require analysis and/or reasoning about the code (e.g., what is true/false
about the snippet, or what is its output) appear to be the most challenging.
These findings can be leveraged by educators to adapt their instructional
practices and assessments in programming courses, so that GPT becomes a
valuable assistant for a learner as opposed to a source of confusion and/or
potential hindrance in the learning process.Comment: 12 page
CodeHelp: Using Large Language Models with Guardrails for Scalable Support in Programming Classes
Computing educators face significant challenges in providing timely support
to students, especially in large class settings. Large language models (LLMs)
have emerged recently and show great promise for providing on-demand help at a
large scale, but there are concerns that students may over-rely on the outputs
produced by these models. In this paper, we introduce CodeHelp, a novel
LLM-powered tool designed with guardrails to provide on-demand assistance to
programming students without directly revealing solutions. We detail the design
of the tool, which incorporates a number of useful features for instructors,
and elaborate on the pipeline of prompting strategies we use to ensure
generated outputs are suitable for students. To evaluate CodeHelp, we deployed
it in a first-year computer and data science course with 52 students and
collected student interactions over a 12-week period. We examine students'
usage patterns and perceptions of the tool, and we report reflections from the
course instructor and a series of recommendations for classroom use. Our
findings suggest that CodeHelp is well-received by students who especially
value its availability and help with resolving errors, and that for instructors
it is easy to deploy and complements, rather than replaces, the support that
they provide to students
Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses
This paper studies recent developments in large language models' (LLM)
abilities to pass assessments in introductory and intermediate Python
programming courses at the postsecondary level. The emergence of ChatGPT
resulted in heated debates of its potential uses (e.g., exercise generation,
code explanation) as well as misuses in programming classes (e.g., cheating).
Recent studies show that while the technology performs surprisingly well on
diverse sets of assessment instruments employed in typical programming classes
the performance is usually not sufficient to pass the courses. The release of
GPT-4 largely emphasized notable improvements in the capabilities related to
handling assessments originally designed for human test-takers. This study is
the necessary analysis in the context of this ongoing transition towards mature
generative AI systems. Specifically, we report the performance of GPT-4,
comparing it to the previous generations of GPT models, on three Python courses
with assessments ranging from simple multiple-choice questions (no code
involved) to complex programming projects with code bases distributed into
multiple files (599 exercises overall). Additionally, we analyze the
assessments that were not handled well by GPT-4 to understand the current
limitations of the model, as well as its capabilities to leverage feedback
provided by an auto-grader. We found that the GPT models evolved from
completely failing the typical programming class' assessments (the original
GPT-3) to confidently passing the courses with no human involvement (GPT-4).
While we identified certain limitations in GPT-4's handling of MCQs and coding
exercises, the rate of improvement across the recent generations of GPT models
strongly suggests their potential to handle almost any type of assessment
widely used in higher education programming courses. These findings could be
leveraged by educators and institutions to adapt the design of programming
assessments as well as to fuel the necessary discussions into how programming
classes should be updated to reflect the recent technological developments.
This study provides evidence that programming instructors need to prepare for a
world in which there is an easy-to-use widely accessible technology that can be
utilized by learners to collect passing scores, with no effort whatsoever, on
what today counts as viable programming knowledge and skills assessments
Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses?
We evaluated the capability of generative pre-trained transformers (GPT), to
pass assessments in introductory and intermediate Python programming courses at
the postsecondary level. Discussions of potential uses (e.g., exercise
generation, code explanation) and misuses (e.g., cheating) of this emerging
technology in programming education have intensified, but to date there has not
been a rigorous analysis of the models' capabilities in the realistic context
of a full-fledged programming course with diverse set of assessment
instruments. We evaluated GPT on three Python courses that employ assessments
ranging from simple multiple-choice questions (no code involved) to complex
programming projects with code bases distributed into multiple files (599
exercises overall). Further, we studied if and how successfully GPT models
leverage feedback provided by an auto-grader. We found that the current models
are not capable of passing the full spectrum of assessments typically involved
in a Python programming course (<70% on even entry-level modules). Yet, it is
clear that a straightforward application of these easily accessible models
could enable a learner to obtain a non-trivial portion of the overall available
score (>55%) in introductory and intermediate courses alike. While the models
exhibit remarkable capabilities, including correcting solutions based on
auto-grader's feedback, some limitations exist (e.g., poor handling of
exercises requiring complex chains of reasoning steps). These findings can be
leveraged by instructors wishing to adapt their assessments so that GPT becomes
a valuable assistant for a learner as opposed to an end-to-end solution.Comment: 7 pages. arXiv admin note: text overlap with arXiv:2303.0803
Explaining Legal Concepts with Augmented Large Language Models (GPT-4)
Interpreting the meaning of legal open-textured terms is a key task of legal
professionals. An important source for this interpretation is how the term was
applied in previous court cases. In this paper, we evaluate the performance of
GPT-4 in generating factually accurate, clear and relevant explanations of
terms in legislation. We compare the performance of a baseline setup, where
GPT-4 is directly asked to explain a legal term, to an augmented approach,
where a legal information retrieval module is used to provide relevant context
to the model, in the form of sentences from case law. We found that the direct
application of GPT-4 yields explanations that appear to be of very high quality
on their surface. However, detailed analysis uncovered limitations in terms of
the factual accuracy of the explanations. Further, we found that the
augmentation leads to improved quality, and appears to eliminate the issue of
hallucination, where models invent incorrect statements. These findings open
the door to the building of systems that can autonomously retrieve relevant
sentences from case law and condense them into a useful explanation for legal
scholars, educators or practicing lawyers alike
Discovering sentences for argumentation about the meaning of statutory terms
In this work I studied, designed, and evaluated computational methods to support interpretation of statutory terms. Understanding statutes is difficult because the abstract rules they express must account for diverse situations, even those not yet encountered. The interpretation involves an investigation of how a particular term has been referred to, explained, interpreted, or applied in the past. This is an important step that enables a lawyer to then construct arguments in support of or against particular interpretations. Going through the list of results manually is labor intensive. A response to a search query may consist of hundreds or thousands of documents. I investigated the feasibility of developing a system that would respond to a query with a list of sentences that mention the term in a way that is useful for understanding and elaborating its meaning. I treat the discovery of sentences for argumentation about the meaning of statutory terms as a special case of ad hoc document retrieval. The specifics include retrieval of short texts (sentences), specialized document types (legal case texts), and, above all, the unique definition of document relevance.
This work makes a number of contributions to the areas of legal information retrieval and legal text analytics. First, a novel task of discovering sentences for argumentation about the meaning of statutory terms is proposed. The task includes analyzing past treatment of a statutory term, a task lawyers routinely perform using a combination of manual and computational approaches. Second, a data set comprising 42 queries (26,959 sentences) was assembled to support the experiments presented here. Third, by systematically assessing the performance of a considerable number of traditional information retrieval techniques, I position this novel task in the context of a large body of work on ad hoc document retrieval. Fourth, I assembled a unique list of 129 descriptive features that model the retrieved sentences, their relationships to the terms of interest, as well as the statutory provisions they come from. I demonstrate how the proposed feature set could be utilized in learning-to-rank settings by showing how a number of machine learning algorithms learn to rank the sentences with very reasonable effectiveness. Fifth, I analyze the effectiveness of fine-tuning pre-trained language models in the context of this special task and demonstrate a very promising direction for future work
The Unreasonable Effectiveness of Large Language Models in Zero-Shot Semantic Annotation of Legal Texts
The emergence of ChatGPT has sensitized the general public, including the legal profession, to large language models\u27 (LLMs) potential uses (e.g., document drafting, question answering, and summarization). Although recent studies have shown how well the technology performs in diverse semantic annotation tasks focused on legal texts, an influx of newer, more capable (GPT-4) or cost-effective (GPT-3.5-turbo) models requires another analysis. This paper addresses recent developments in the ability of LLMs to semantically annotate legal texts in zero-shot learning settings. Given the transition to mature generative AI systems, we examine the performance of GPT-4 and GPT-3.5-turbo(-16k), comparing it to the previous generation of GPT models, on three legal text annotation tasks involving diverse documents such as adjudicatory opinions, contractual clauses, or statutory provisions. We also compare the models\u27 performance and cost to better understand the trade-offs. We found that the GPT-4 model clearly outperforms the GPT-3.5 models on two of the three tasks. The cost-effective GPT-3.5-turbo matches the performance of the 20× more expensive text-davinci-003 model. While one can annotate multiple data points within a single prompt, the performance degrades as the size of the batch increases. This work provides valuable information relevant for many practical applications (e.g., in contract review) and research projects (e.g., in empirical legal studies). Legal scholars and practicing lawyers alike can leverage these findings to guide their decisions in integrating LLMs in a wide range of workflows involving semantic annotation of legal texts
Toward Automatically Identifying Legally Relevant Factors
In making legal decisions, courts apply relevant law to facts. While the law typically changes slowly over time, facts vary from case to case. Nevertheless, underlying patterns of fact may emerge. This research focuses on underlying fact patterns commonly present in cases where motorists are stopped for a traffic violation and subsequently detained while a police officer conducts a canine sniff of the vehicle for drugs. We present a set of underlying patterns of fact, that is, factors of suspicion, that police and courts apply in determining reasonable suspicion. We demonstrate how these fact patterns can be identified and annotated in legal cases and how these annotations can be employed to fine-tune a transformer model to identify the factors in previously unseen legal opinions