273,463 research outputs found
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Out-of-distribution (OOD) testing is increasingly popular for evaluating a
machine learning system's ability to generalize beyond the biases of a training
set. OOD benchmarks are designed to present a different joint distribution of
data and labels between training and test time. VQA-CP has become the standard
OOD benchmark for visual question answering, but we discovered three troubling
practices in its current use. First, most published methods rely on explicit
knowledge of the construction of the OOD splits. They often rely on
``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the
common training answer is 'no'. Second, the OOD test set is used for model
selection. Third, a model's in-domain performance is assessed after retraining
it on in-domain splits (VQA v2) that exhibit a more balanced distribution of
labels. These three practices defeat the objective of evaluating
generalization, and put into question the value of methods specifically
designed for this dataset. We show that embarrassingly-simple methods,
including one that generates answers at random, surpass the state of the art on
some question types. We provide short- and long-term solutions to avoid these
pitfalls and realize the benefits of OOD evaluation
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks
Research on automated text summarization relies heavily on human and
automatic evaluation. While recent work on human evaluation mainly adopted
intrinsic evaluation methods, judging the generic quality of text summaries,
e.g. informativeness and coherence, our work focuses on evaluating the
usefulness of text summaries with extrinsic methods. We carefully design three
different downstream tasks for extrinsic human evaluation of summaries, i.e.,
question answering, text classification and text similarity assessment. We
carry out experiments using system rankings and user behavior data to evaluate
the performance of different summarization models. We find summaries are
particularly useful in tasks that rely on an overall judgment of the text,
while being less effective for question answering tasks. The results show that
summaries generated by fine-tuned models lead to higher consistency in
usefulness across all three tasks, as rankings of fine-tuned summarization
systems are close across downstream tasks according to the proposed extrinsic
metrics. Summaries generated by models in the zero-shot setting, however, are
found to be biased towards the text classification and similarity assessment
tasks, due to its general and less detailed summary style. We further evaluate
the correlation of 14 intrinsic automatic metrics with human criteria and show
that intrinsic automatic metrics perform well in evaluating the usefulness of
summaries in the question-answering task, but are less effective in the other
two tasks. This highlights the limitations of relying solely on intrinsic
automatic metrics in evaluating the performance and usefulness of summaries
WiSeBE: Window-based Sentence Boundary Evaluation
Sentence Boundary Detection (SBD) has been a major research topic since
Automatic Speech Recognition transcripts have been used for further Natural
Language Processing tasks like Part of Speech Tagging, Question Answering or
Automatic Summarization. But what about evaluation? Do standard evaluation
metrics like precision, recall, F-score or classification error; and more
important, evaluating an automatic system against a unique reference is enough
to conclude how well a SBD system is performing given the final application of
the transcript? In this paper we propose Window-based Sentence Boundary
Evaluation (WiSeBE), a semi-supervised metric for evaluating Sentence Boundary
Detection systems based on multi-reference (dis)agreement. We evaluate and
compare the performance of different SBD systems over a set of Youtube
transcripts using WiSeBE and standard metrics. This double evaluation gives an
understanding of how WiSeBE is a more reliable metric for the SBD task.Comment: In proceedings of the 17th Mexican International Conference on
Artificial Intelligence (MICAI), 201
- …