31 research outputs found
Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution
Spurious correlations were found to be an important factor explaining model
performance in various NLP tasks (e.g., gender or racial artifacts), often
considered to be ''shortcuts'' to the actual task. However, humans tend to
similarly make quick (and sometimes wrong) predictions based on societal and
cognitive presuppositions. In this work we address the question: can we
quantify the extent to which model biases reflect human behaviour? Answering
this question will help shed light on model performance and provide meaningful
comparisons against humans. We approach this question through the lens of the
dual-process theory for human decision-making. This theory differentiates
between an automatic unconscious (and sometimes biased) ''fast system'' and a
''slow system'', which when triggered may revisit earlier automatic reactions.
We make several observations from two crowdsourcing experiments of gender bias
in coreference resolution, using self-paced reading to study the ''fast''
system, and question answering to study the ''slow'' system under a constrained
time setting. On real-world data humans make 3\% more gender-biased
decisions compared to models, while on synthetic data models are 12\%
more biased
Crowdsourcing Question-Answer Meaning Representations
We introduce Question-Answer Meaning Representations (QAMRs), which represent
the predicate-argument structure of a sentence as a set of question-answer
pairs. We also develop a crowdsourcing scheme to show that QAMRs can be labeled
with very little training, and gather a dataset with over 5,000 sentences and
100,000 questions. A detailed qualitative analysis demonstrates that the
crowd-generated question-answer pairs cover the vast majority of
predicate-argument relationships in existing datasets (including PropBank,
NomBank, QA-SRL, and AMR) along with many previously under-resourced ones,
including implicit arguments and relations. The QAMR data and annotation code
is made publicly available to enable future work on how best to model these
complex phenomena.Comment: 8 pages, 6 figures, 2 table
Evaluating Gender Bias in Machine Translation
We present the first challenge set and evaluation protocol for the analysis
of gender bias in machine translation (MT). Our approach uses two recent
coreference resolution datasets composed of English sentences which cast
participants into non-stereotypical gender roles (e.g., "The doctor asked the
nurse to help her in the operation"). We devise an automatic gender bias
evaluation method for eight target languages with grammatical gender, based on
morphological analysis (e.g., the use of female inflection for the word
"doctor"). Our analyses show that four popular industrial MT systems and two
recent state-of-the-art academic MT models are significantly prone to
gender-biased translation errors for all tested target languages. Our data and
code are made publicly available.Comment: Accepted to ACL 201
Evaluating and Improving the Coreference Capabilities of Machine Translation Models
Machine translation (MT) requires a wide range of linguistic capabilities,
which current end-to-end models are expected to learn implicitly by observing
aligned sentences in bilingual corpora. In this work, we ask: \emph{How well do
MT models learn coreference resolution from implicit signal?} To answer this
question, we develop an evaluation methodology that derives coreference
clusters from MT output and evaluates them without requiring annotations in the
target language. We further evaluate several prominent open-source and
commercial MT systems, translating from English to six target languages, and
compare them to state-of-the-art coreference resolvers on three challenging
benchmarks. Our results show that the monolingual resolvers greatly outperform
MT models. Motivated by this result, we experiment with different methods for
incorporating the output of coreference resolution models in MT, showing
improvement over strong baselines.Comment: EACL pape