71 research outputs found

    Can LLMs Grade Short-answer Reading Comprehension Questions : Foundational Literacy Assessment in LMICs

    Full text link
    This paper presents emerging evidence of using generative large language models (i.e., GPT-4) to reliably evaluate short-answer reading comprehension questions. Specifically, we explore how various configurations of generative (LLMs) are able to evaluate student responses from a new dataset, drawn from a battery of reading assessments conducted with over 150 students in Ghana. As this dataset is novel and hence not used in training runs of GPT, it offers an opportunity to test for domain shift and evaluate the generalizability of generative LLMs, which are predominantly designed and trained on data from high-income North American countries. We found that GPT-4, with minimal prompt engineering performed extremely well on evaluating the novel dataset (Quadratic Weighted Kappa 0.923, F1 0.88), substantially outperforming transfer-learning based approaches, and even exceeding expert human raters (Quadratic Weighted Kappa 0.915, F1 0.87). To the best of our knowledge, our work is the first to empirically evaluate the performance of generative LLMs on short-answer reading comprehension questions, using real student data, and suggests that generative LLMs have the potential to reliably evaluate foundational literacy. Currently the assessment of formative literacy and numeracy is infrequent in many low and middle-income countries (LMICs) due to the cost and operational complexities of conducting them at scale. Automating the grading process for reading assessment could enable wider usage, and in turn improve decision-making regarding curricula, school management, and teaching practice at the classroom level. Importantly, in contrast transfer learning based approaches, generative LLMs generalize well and the technical barriers to their use are low, making them more feasible to implement and scale in lower resource educational contexts

    Using State-of-the-Art Speech Models to Evaluate Oral Reading Fluency in Ghana

    Full text link
    This paper reports on a set of three recent experiments utilizing large-scale speech models to evaluate the oral reading fluency (ORF) of students in Ghana. While ORF is a well-established measure of foundational literacy, assessing it typically requires one-on-one sessions between a student and a trained evaluator, a process that is time-consuming and costly. Automating the evaluation of ORF could support better literacy instruction, particularly in education contexts where formative assessment is uncommon due to large class sizes and limited resources. To our knowledge, this research is among the first to examine the use of the most recent versions of large-scale speech models (Whisper V2 wav2vec2.0) for ORF assessment in the Global South. We find that Whisper V2 produces transcriptions of Ghanaian students reading aloud with a Word Error Rate of 13.5. This is close to the model's average WER on adult speech (12.8) and would have been considered state-of-the-art for children's speech transcription only a few years ago. We also find that when these transcriptions are used to produce fully automated ORF scores, they closely align with scores generated by expert human graders, with a correlation coefficient of 0.96. Importantly, these results were achieved on a representative dataset (i.e., students with regional accents, recordings taken in actual classrooms), using a free and publicly available speech model out of the box (i.e., no fine-tuning). This suggests that using large-scale speech models to assess ORF may be feasible to implement and scale in lower-resource, linguistically diverse educational contexts

    Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference

    Full text link
    For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process - including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context - such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.Comment: 6 pages, presented at NeurIPS'23 Workshop on Generative AI for Education (GAIED

    Contribution to the Diffuse Radio Background from Extragalactic Radio Sources

    Full text link
    We examine the brightness of the Cosmic Radio Background (CRB) by comparing the contribution from individual source counts to absolute measurements. We use a compilation of radio counts to estimate the contribution of detected sources to the CRB in several different frequency bands.We apply a Monte Carlo Markov Chain technique to estimate the brightness values and uncertainties, paying attention to various sources of systematic error. We compare our results to absolute measurements from the ARCADE 2 experiment. At v = 150 MHz, 325 MHz, 408 MHz, 610 MHz, 1.4 GHz, 4.8 GHz, and 8.4 GHz our calculated contributions to the background sky temperature are 18, 2.8, 1.6, 0.71, 0.11, 0.0032, 0.0059 K, respectively. If the ARCADE 2 measurements are correct and come from sources, then there must be an additional population of radio galaxies, fainter than where current data are probing. More specifically, the Euclidean-normalized counts at 1.4 GHz have to have an additional bump below about 10 {\mu}Jy.Comment: 9 pages, 7 figures, 3 tables, accepted MNRA

    Isotopic abundances of carbon and nitrogen in Jupiter-family and Oort Cloud comets

    Get PDF
    The 12C14N/12C15N and 12C14N/13C14N isotopic ratios are determined for the first time in a Jupiter-family comet, 88P/1981 Q1 Howell, and in the chemically peculiar Oort Cloud comet C/1999 S4 (LINEAR). By comparing these measurements to previous ones derived for six other Oort Cloud comets (including one of Halley-type), we find that both the carbon and nitrogen isotopic ratios are constant within the uncertainties. The mean values are 12C/13C ~ 90 and 14N/15N \~ 145 for the eight comets. These results strengthen the view that CN radicals originate from refractory organics formed in the protosolar molecular cloud and subsequently incorporated in comets.Comment: Accepted for publication in A&A letter

    The experience of Community Programme, unemployment and employment : mental health and individual differences.

    Get PDF
    This thesis explores some theoretical, conceptual, empirical and methodological issues concerning psychological research into unemployment. A review of the literature revealed some important limitations in the approach which has hitherto been taken to examine this phenomenon. Specific weaknesses included an undervaluation of the role of theory, a dearth of empirical research on intervention programmes or other responses to unemployment, as well as oversimplification, overgeneralisation, imprecision and unfalsifiability in the theoretical contributions which have been offered. Moreover, it was noted that there had been a lack of attention to dispositional factors in empirical research or theory, and inadequate (particularly undifferentiated) conceptualisation and operationalisation of mental health variables. The empirical part of the study, therefore, was developed as an initial exploration of (a) Individual differences in the mental health of unemployed adults, and (b) the experience of participation on Community Programme (CP), a UK government intervention for long-term unemployed adults. A multi-method, multivariate design was used adopting a theoretically grounded, guiding conceptual framework. Qualitative in-depth interviews (N = 60) were conducted with CP participants from two CP managing agencies. In addition, a large scale cross-sectional quantitative survey (N=484) was undertaken incorporating individuals who were: (a) Participating on CP (b) Employed (c) Unemployed. The findings of the stud demonstrated a number of relationships between personal characteristics (i. e. demographic and personality related variables), intervening variables and dimensions of mental health. Some theoretical and empirical implications of these findings were discussed and directions for future empirical research and theoretical development were suggested. With respect to the experience of Community Programme, the findings suggested that within these two managing agencies, the content of the scheme (i. e. the nature of the work) was evaluated positively by the respondents, but that the context of the scheme and its temporary nature were perceived in a negative light. Some suggestions are made as to how these different aspects of the scheme impacted upon the mental health of the participants
    corecore