Automatic Essay Scoring (AES) is a well-established educational pursuit that
employs machine learning to evaluate student-authored essays. While much effort
has been made in this area, current research primarily focuses on either (i)
boosting the predictive accuracy of an AES model for a specific prompt (i.e.,
developing prompt-specific models), which often heavily relies on the use of
the labeled data from the same target prompt; or (ii) assessing the
applicability of AES models developed on non-target prompts to the intended
target prompt (i.e., developing the AES models in a cross-prompt setting).
Given the inherent bias in machine learning and its potential impact on
marginalized groups, it is imperative to investigate whether such bias exists
in current AES methods and, if identified, how it intervenes with an AES
model's accuracy and generalizability. Thus, our study aimed to uncover the
intricate relationship between an AES model's accuracy, fairness, and
generalizability, contributing practical insights for developing effective AES
models in real-world education. To this end, we meticulously selected nine
prominent AES methods and evaluated their performance using seven metrics on an
open-sourced dataset, which contains over 25,000 essays and various demographic
information about students such as gender, English language learner status, and
economic status. Through extensive evaluations, we demonstrated that: (1)
prompt-specific models tend to outperform their cross-prompt counterparts in
terms of predictive accuracy; (2) prompt-specific models frequently exhibit a
greater bias towards students of different economic statuses compared to
cross-prompt models; (3) in the pursuit of generalizability, traditional
machine learning models coupled with carefully engineered features hold greater
potential for achieving both high accuracy and fairness than complex neural
network models