67 research outputs found
Gated Convolutional Bidirectional Attention-based Model for Off-topic Spoken Response Detection
Off-topic spoken response detection, the task aiming at predicting whether a
response is off-topic for the corresponding prompt, is important for an
automated speaking assessment system. In many real-world educational
applications, off-topic spoken response detectors are required to achieve high
recall for off-topic responses not only on seen prompts but also on prompts
that are unseen during training. In this paper, we propose a novel approach for
off-topic spoken response detection with high off-topic recall on both seen and
unseen prompts. We introduce a new model, Gated Convolutional Bidirectional
Attention-based Model (GCBiA), which applies bi-attention mechanism and
convolutions to extract topic words of prompts and key-phrases of responses,
and introduces gated unit and residual connections between major layers to
better represent the relevance of responses and prompts. Moreover, a new
negative sampling method is proposed to augment training data. Experiment
results demonstrate that our novel approach can achieve significant
improvements in detecting off-topic responses with extremely high on-topic
recall, for both seen and unseen prompts.Comment: ACL2020 long pape
Incorporating uncertainty into deep learning for spoken language assessment
There is a growing demand for automatic
assessment of spoken English proficiency.
These systems need to handle large vari-
ations in input data owing to the wide
range of candidate skill levels and L1s, and
errors from ASR. Some candidates will
be a poor match to the training data set,
undermining the validity of the predicted
grade. For high stakes tests it is essen-
tial for such systems not only to grade
well, but also to provide a measure of
their uncertainty in their predictions, en-
abling rejection to human graders. Pre-
vious work examined Gaussian Process
(GP) graders which, though successful, do
not scale well with large data sets. Deep
Neural Networks (DNN) may also be used
to provide uncertainty using Monte-Carlo
Dropout (MCD). This paper proposes a
novel method to yield uncertainty and
compares it to GPs and DNNs with MCD.
The proposed approach
explicitly
teaches
a DNN to have low uncertainty on train-
ing data and high uncertainty on generated
artificial data. On experiments conducted
on data from the Business Language Test-
ing Service (BULATS), the proposed ap-
proach is found to outperform GPs and
DNNs with MCD in uncertainty-based re-
jection whilst achieving comparable grad-
ing performance
Impact of ASR performance on free speaking language assessment
In free speaking tests candidates respond in spontaneous speech to prompts. This form of test allows the spoken language proficiency of a non-native speaker of English to be assessed more fully than read aloud tests. As the candidate's responses are unscripted, transcription by automatic speech recognition (ASR) is essential for automated assessment. ASR will never be 100% accurate so any assessment system must seek to minimise and mitigate ASR errors. This paper considers the impact of ASR errors on the performance of free speaking test auto-marking systems. Firstly rich linguistically related features, based on part-of-speech tags from statistical parse trees, are investigated for assessment. Then, the impact of ASR errors on how well the system can detect whether a learner's answer is relevant to the question asked is evaluated. Finally, the impact that these errors may have on the ability of the system to provide detailed feedback to the learner is analysed. In particular, pronunciation and grammatical errors are considered as these are important in helping a learner to make progress. As feedback resulting from an ASR error would be highly confusing, an approach to mitigate this problem using confidence scores is also analysed
An attention based model for off-topic spontaneous spoken response detection: An Initial Study
Automatic spoken language assessment systems are gaining popularity due to the rising demand for English second language learning. Current systems primarily assess fluency \
and pronunciation, rather than semantic content and relevance of a candidate's response to a prompt. However, to increase reliability and robustness, relevance assessment an\
d off-topic response detection are desirable, particularly for spontaneous spoken responses to open-ended prompts. Previously proposed approaches usually require prompt-resp\
onse pairs for all prompts. This limits flexibility as example responses are required whenever a new test prompt is introduced.
This paper presents a initial study of an attention based neural model which assesses the relevance of prompt-response pairs without the need to see them in training. This model uses a bidirectional Recurrent Neural Network (BiRNN) embedding of the prompt to compute attention over the hidden states of a BiRNN embedding of the response. The resulting fixed-length embedding is fed into a binary classifier to predict relevance of the response. Due to a lack of off-topic responses, negative examples for both training and evaluation are created by randomly shuffling prompts and responses. On spontaneous spoken data this system is able to assess relevance to both seen and unseen prompts
Recommended from our members
Off-topic response detection for spontaneous spoken English assessment
Automatic spoken language assessment systems are becoming increasingly
important to meet the demand for English second language learning. This is a challenging task due to the high error rates of, even state-of-the-art, non-native speech recognition. Consequently current systems primarily assess fluency and pronunciation. However, content assessment is essential for full automation. As a first stage it is important to judge whether the speaker responds on topic to test questions designed to elicit spontaneous speech. Standard approaches to off-topic response detection assess similarity between the response and question based on bag-of-words representations. An alternative framework based on Recurrent Neural Network Language Models (RNNLM) is proposed in this paper. The RNNLM is adapted to
the topic of each test question. It learns to associate example responses to questions with points in a topic space constructed using these example responses. Classification is done by ranking the topic-conditional posterior probabilities of a response. The RNNLMs associate a broad range of responses with each topic, incorporate sequence information and scale better with additional training data, unlike standard methods. On experiments conducted on data from the Business Language Testing Service (BULATS) this approach outperforms standard approaches
Recommended from our members
Complementary systems for Off-Topic spoken response detection
Increased demand to learn English for business and education has led to growing interest in automatic spoken language assessment and teaching systems. With this shift to automated approaches it is important that systems reliably assess all aspects of a candidate's responses. This paper examines one form of spoken language assessment; whether the response from the candidate is relevant to the prompt provided. This will be referred to as off-topic spoken response detection. Two forms of previously proposed approaches are examined in this work: the hierarchical attention-based topic model (HATM); and the similarity grid model (SGM). The work focuses on the scenario when the prompt, and associated responses, have not been seen in the training data, enabling the system to be applied to new test scripts without the need to collect data or retrain the model. To improve the performance of the systems for unseen prompts, data augmentation based on easy data augmentation (EDA) and translation based approaches are applied. Additionally for the HATM, a form of prompt dropout is described. The systems were evaluated on both seen and unseen prompts from Linguaskill Business and General English tests. For unseen data the performance of the HATM was improved using data augmentation, in contrast to the SGM where no gains were obtained. The two approaches were found to be complementary to one another, yielding a combined F(0.5) score of 0.814
for off-topic response detection where the prompts have not been seen in training.ALT
Automatic Essay Scoring Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses
Deep-learning based Automatic Essay Scoring (AES) systems are being actively used in various high-stake applications in education and testing. However, little research has been put to understand and interpret the black-box nature of deep-learning-based scoring algorithms. While previous studies indicate that scoring models can be easily fooled, in this paper, we explore the reason behind their surprising adversarial brittleness. We utilize recent advances in interpretability to find the extent to which features such as coherence, content, vocabulary, and relevance are important for automated scoring mechanisms. We use this to investigate the oversensitivity (i.e., large change in output score with a little change in input essay content) and overstability (i.e., little change in output scores with large changes in input essay content) of AES. Our results indicate that autoscoring models, despite getting trained as “end-to-end” models with rich contextual embeddings such as BERT, behave like bag-of-words models. A few words determine the essay score without the requirement of any context making the model largely overstable. This is in stark contrast to recent probing studies on pre-trained representation learning models, which show that rich linguistic features such as parts-of-speech and morphology are encoded by them. Further, we also find that the models have learnt dataset biases, making them oversensitive. The presence of a few words with high co-occurrence with a certain score class makes the model associate the essay sample with that score. This causes score changes in ∼95% of samples with an addition of only a few words. To deal with these issues, we propose detection-based protection models that can detect oversensitivity and samples causing overstability with high accuracies. We find that our proposed models are able to detect unusual attribution patterns and flag adversarial samples successfully
- …