27 research outputs found
Getting Gender Right in Neural Machine Translation
Speakers of different languages must attend
to and encode strikingly different aspects of
the world in order to use their language correctly (Sapir, 1921; Slobin, 1996). One such
difference is related to the way gender is expressed in a language. Saying âI am happyâ
in English, does not encode any additional
knowledge of the speaker that uttered the sentence. However, many other languages do
have grammatical gender systems and so such
knowledge would be encoded. In order to
correctly translate such a sentence into, say,
French, the inherent gender information needs
to be retained/recovered. The same sentence
would become either âJe suis heureuxâ, for a
male speaker or âJe suis heureuseâ for a female one. Apart from morphological agreement, demographic factors (gender, age, etc.)
also influence our use of language in terms of
word choices or even on the level of syntactic constructions (Tannen, 1991; Pennebaker
et al., 2003). We integrate gender information
into NMT systems. Our contribution is twofold: (1) the compilation of large datasets with
speaker information for 20 language pairs, and
(2) a simple set of experiments that incorporate gender information into NMT for multiple language pairs. Our experiments show that
adding a gender feature to an NMT system significantly improves the translation quality for
some language pairs
Joint translation and unit conversion for end-to-end localization
A variety of natural language tasks require processing of textual data which
contains a mix of natural language and formal languages such as mathematical
expressions. In this paper, we take unit conversions as an example and propose
a data augmentation technique which leads to models learning both translation
and conversion tasks as well as how to adequately switch between them for
end-to-end localization
Disembodied Machine Learning: On the Illusion of Objectivity in NLP
Machine Learning seeks to identify and encode bodies of knowledge within
provided datasets. However, data encodes subjective content, which determines
the possible outcomes of the models trained on it. Because such subjectivity
enables marginalisation of parts of society, it is termed (social) `bias' and
sought to be removed. In this paper, we contextualise this discourse of bias in
the ML community against the subjective choices in the development process.
Through a consideration of how choices in data and model development construct
subjectivity, or biases that are represented in a model, we argue that
addressing and mitigating biases is near-impossible. This is because both data
and ML models are objects for which meaning is made in each step of the
development pipeline, from data selection over annotation to model training and
analysis. Accordingly, we find the prevalent discourse of bias limiting in its
ability to address social marginalisation. We recommend to be conscientious of
this, and to accept that de-biasing methods only correct for a fraction of
biases.Comment: In revie
Gender Representation in Open Source Speech Resources
With the rise of artificial intelligence (AI) and the growing use of
deep-learning architectures, the question of ethics, transparency and fairness
of AI systems has become a central concern within the research community. We
address transparency and fairness in spoken language systems by proposing a
study about gender representation in speech resources available through the
Open Speech and Language Resource platform. We show that finding gender
information in open source corpora is not straightforward and that gender
balance depends on other corpus characteristics (elicited/non elicited speech,
low/high resource language, speech task targeted). The paper ends with
recommendations about metadata and gender information for researchers in order
to assure better transparency of the speech systems built using such corpora.Comment: accepted to LREC202