2 research outputs found
A Resource for Computational Experiments on Mapudungun
We present a resource for computational experiments on Mapudungun, a
polysynthetic indigenous language spoken in Chile with upwards of 200 thousand
speakers. We provide 142 hours of culturally significant conversations in the
domain of medical treatment. The conversations are fully transcribed and
translated into Spanish. The transcriptions also include annotations for
code-switching and non-standard pronunciations. We also provide baseline
results on three core NLP tasks: speech recognition, speech synthesis, and
machine translation between Spanish and Mapudungun. We further explore other
applications for which the corpus will be suitable, including the study of
code-switching, historical orthography change, linguistic structure, and
sociological and anthropological studies.Comment: accepted at LREC 202
Predicting Performance for Natural Language Processing Tasks
Given the complexity of combinations of tasks, languages, and domains in
natural language processing (NLP) research, it is computationally prohibitive
to exhaustively test newly proposed models on each possible experimental
setting. In this work, we attempt to explore the possibility of gaining
plausible judgments of how well an NLP model can perform under an experimental
setting, without actually training or testing the model. To do so, we build
regression models to predict the evaluation score of an NLP experiment given
the experimental settings as input. Experimenting on 9 different NLP tasks, we
find that our predictors can produce meaningful predictions over unseen
languages and different modeling architectures, outperforming reasonable
baselines as well as human experts. Going further, we outline how our predictor
can be used to find a small subset of representative experiments that should be
run in order to obtain plausible predictions for all other experimental
settings.Comment: Accepted at ACL'2