Search CORE

4 research outputs found

Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Author: Golchin Shahriar
Surdeanu Mihai
Publication venue
Publication date: 16/08/2023
Field of study

Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.Comment: v1 preprin

arXiv.org e-Print Archive

Large Language Models As MOOCs Graders

Author: Garuda Nikhil
Golchin Shahriar
Impey Christopher
Wenger Matthew
Publication venue
Publication date: 29/02/2024
Field of study

Massive open online courses (MOOCs) unlock the doors to free education for anyone around the globe with access to a computer and the internet. Despite this democratization of learning, the massive enrollment in these courses means it is almost impossible for one instructor to assess every student's writing assignment. As a result, peer grading, often guided by a straightforward rubric, is the method of choice. While convenient, peer grading often falls short in terms of reliability and validity. In this study, using 18 distinct settings, we explore the feasibility of leveraging large language models (LLMs) to replace peer grading in MOOCs. Specifically, we focus on two state-of-the-art LLMs: GPT-4 and GPT-3.5, across three distinct courses: Introductory Astronomy, Astrobiology, and the History and Philosophy of Astronomy. To instruct LLMs, we use three different prompts based on a variant of the zero-shot chain-of-thought (Zero-shot-CoT) prompting technique: Zero-shot-CoT combined with instructor-provided correct answers; Zero-shot-CoT in conjunction with both instructor-formulated answers and rubrics; and Zero-shot-CoT with instructor-offered correct answers and LLM-generated rubrics. Our results show that Zero-shot-CoT, when integrated with instructor-provided answers and rubrics, produces grades that are more aligned with those assigned by instructors compared to peer grading. However, the History and Philosophy of Astronomy course proves to be more challenging in terms of grading as opposed to other courses. Finally, our study reveals a promising direction for automating grading systems for MOOCs, especially in subjects with well-defined rubrics.Comment: v1.3 preprin

arXiv.org e-Print Archive

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Author: Golchin Shahriar
Kiapour Ata
Surdeanu Mihai
Tavabi Nazgol
Publication venue
Publication date: 14/07/2023
Field of study

We propose a novel task-agnostic in-domain pre-training method that sits between generic pre-training and fine-tuning. Our approach selectively masks in-domain keywords, i.e., words that provide a compact representation of the target domain. We identify such keywords using KeyBERT (Grootendorst, 2020). We evaluate our approach using six different settings: three datasets combined with two distinct pre-trained language models (PLMs). Our results reveal that the fine-tuned PLMs adapted using our in-domain pre-training strategy outperform PLMs that used in-domain pre-training with random masking as well as those that followed the common pre-train-then-fine-tune paradigm. Further, the overhead of identifying in-domain keywords is reasonable, e.g., 7-15% of the pre-training time (for two epochs) for BERT Large (Devlin et al., 2019).Comment: final version: accepted at ACL'23 RepL4NLP. arXiv admin note: text overlap with arXiv:2208.1236

arXiv.org e-Print Archive

Recommended from our members

Prediction of blast loading on protruded structures using machine learning methods

Author: Golchin Shahriar
Zahedi Mona
Publication venue: 'SAGE Publications'
Publication date: 06/12/2022
Field of study

Current empirical and semi-empirical based design manuals are restricted to the analysis of simple building configurations against blast loading. Prediction of blast loads for complex geometries is typically carried out with computational fluid dynamics solvers, which are known for their high computational cost. The combination of high-fidelity simulations with machine learning tools may significantly accelerate processing time, but the efficacy of such tools must be investigated. The present study evaluates various machine learning algorithms to predict peak overpressure and impulse on a protruded structure exposed to blast loading. A dataset with over 250,000 data points extracted from ProSAir simulations is used to train, validate, and test the models. Among the machine learning algorithms, gradient boosting models outperformed neural networks, demonstrating high predictive power. These models required significantly less time for hyperparameter optimization, and the randomized search approach achieved relatively similar results to that of grid search. Based on permutation feature importance studies, the protrusion length was considered a significantly more influential parameter in the construction of decision trees than building height.Immediate accessThis item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]

The University of Arizona