79,818 research outputs found
On the Value of Out-of-Distribution Testing: An Example of Goodhart's Law
Out-of-distribution (OOD) testing is increasingly popular for evaluating a
machine learning system's ability to generalize beyond the biases of a training
set. OOD benchmarks are designed to present a different joint distribution of
data and labels between training and test time. VQA-CP has become the standard
OOD benchmark for visual question answering, but we discovered three troubling
practices in its current use. First, most published methods rely on explicit
knowledge of the construction of the OOD splits. They often rely on
``inverting'' the distribution of labels, e.g. answering mostly 'yes' when the
common training answer is 'no'. Second, the OOD test set is used for model
selection. Third, a model's in-domain performance is assessed after retraining
it on in-domain splits (VQA v2) that exhibit a more balanced distribution of
labels. These three practices defeat the objective of evaluating
generalization, and put into question the value of methods specifically
designed for this dataset. We show that embarrassingly-simple methods,
including one that generates answers at random, surpass the state of the art on
some question types. We provide short- and long-term solutions to avoid these
pitfalls and realize the benefits of OOD evaluation
Semantic bottleneck for computer vision tasks
This paper introduces a novel method for the representation of images that is
semantic by nature, addressing the question of computation intelligibility in
computer vision tasks. More specifically, our proposition is to introduce what
we call a semantic bottleneck in the processing pipeline, which is a crossing
point in which the representation of the image is entirely expressed with
natural language , while retaining the efficiency of numerical representations.
We show that our approach is able to generate semantic representations that
give state-of-the-art results on semantic content-based image retrieval and
also perform very well on image classification tasks. Intelligibility is
evaluated through user centered experiments for failure detection
Interpretation of Natural Language Rules in Conversational Machine Reading
Most work in machine reading focuses on question answering problems where the
answer is directly expressed in the text to read. However, many real-world
question answering problems require the reading of text not because it contains
the literal answer, but because it contains a recipe to derive an answer
together with the reader's background knowledge. One example is the task of
interpreting regulations to answer "Can I...?" or "Do I have to...?" questions
such as "I am working in Canada. Do I have to carry on paying UK National
Insurance?" after reading a UK government website about this topic. This task
requires both the interpretation of rules and the application of background
knowledge. It is further complicated due to the fact that, in practice, most
questions are underspecified, and a human assistant will regularly have to ask
clarification questions such as "How long have you been working abroad?" when
the answer cannot be directly derived from the question and text. In this
paper, we formalise this task and develop a crowd-sourcing strategy to collect
32k task instances based on real-world rules and crowd-generated questions and
scenarios. We analyse the challenges of this task and assess its difficulty by
evaluating the performance of rule-based and machine-learning baselines. We
observe promising results when no background knowledge is necessary, and
substantial room for improvement whenever background knowledge is needed.Comment: EMNLP 201
Detecting missing content queries in an SMS-Based HIV/AIDS FAQ retrieval system
Automated Frequently Asked Question (FAQ) answering systems use pre-stored sets of question-answer pairs as an information source to answer natural language questions posed by the users. The main problem with this kind of information source is that there is no guarantee that there will be a relevant question-answer pair for all user queries. In this paper, we propose to deploy a binary classifier in an existing SMS-Based HIV/AIDS FAQ retrieval system to detect user queries that do not have the relevant question-answer pair in the FAQ document collection. Before deploying such a classifier, we first evaluate different feature sets for training in order to determine the sets of features that can build a model that yields the best classification accuracy. We carry out our evaluation using seven different feature sets generated from a query log before and after retrieval by the FAQ retrieval system. Our results suggest that, combining different feature sets markedly improves the classification accuracy
- …