3,945 research outputs found
The educational attainment, labour market participation and living conditions of young Roma in Bulgaria, Hungary and Romania
This paper investigates the educational attainment, labour market participation and living conditions of young Roma adults in Bulgaria, Hungary and Romania based on data from the generations and gender surveys and other sources of information. It shows that in spite of a small improvement in the educational attainment of young Roma in comparison to the generation of their parents, the educational achievement and employment gaps have increased considerably during the post-communist period. The paper also compares living conditions of the Roma with other population groups. It concludes with a discussion of policy challenges.minorities, Roma, discrimination, employment, education, transition
Zero-Shot Cross-Lingual Transfer with Meta Learning
Learning what to share between tasks has been a topic of great importance
recently, as strategic sharing of knowledge has been shown to improve
downstream task performance. This is particularly important for multilingual
applications, as most languages in the world are under-resourced. Here, we
consider the setting of training models on multiple different languages at the
same time, when little or no data is available for languages other than
English. We show that this challenging setup can be approached using
meta-learning, where, in addition to training a source language model, another
model learns to select which training instances are the most beneficial to the
first. We experiment using standard supervised, zero-shot cross-lingual, as
well as few-shot cross-lingual settings for different natural language
understanding tasks (natural language inference, question answering). Our
extensive experimental setup demonstrates the consistent effectiveness of
meta-learning for a total of 15 languages. We improve upon the state-of-the-art
for zero-shot and few-shot NLI (on MultiNLI and XNLI) and QA (on the MLQA
dataset). A comprehensive error analysis indicates that the correlation of
typological features between languages can partly explain when parameter
sharing learned via meta-learning is beneficial.Comment: Accepted as long paper in EMNLP2020 main conferenc
English Machine Reading Comprehension Datasets: A Survey
This paper surveys 60 English Machine Reading Comprehension datasets, with a
view to providing a convenient resource for other researchers interested in
this problem. We categorize the datasets according to their question and answer
form and compare them across various dimensions including size, vocabulary,
data source, method of creation, human performance level, and first question
word. Our analysis reveals that Wikipedia is by far the most common data source
and that there is a relative lack of why, when, and where questions across
datasets.Comment: Will appear at EMNLP 2021. Dataset survey paper: 9 pages, 5 figures,
2 tables + attachmen
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
We present bgGLUE (Bulgarian General Language Understanding Evaluation), a
benchmark for evaluating language models on Natural Language Understanding
(NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety
of NLP problems (e.g., natural language inference, fact-checking, named entity
recognition, sentiment analysis, question answering, etc.) and machine learning
tasks (sequence labeling, document-level classification, and regression). We
run the first systematic evaluation of pre-trained language models for
Bulgarian, comparing and contrasting results across the nine tasks in the
benchmark. The evaluation results show strong performance on sequence labeling
tasks, but there is a lot of room for improvement for tasks that require more
complex reasoning. We make bgGLUE publicly available together with the
fine-tuning and the evaluation code, as well as a public leaderboard at
https://bgglue.github.io/, and we hope that it will enable further advancements
in developing NLU models for Bulgarian.Comment: Accepted to ACL 2023 (Main Conference
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants
We present Belebele, a multiple-choice machine reading comprehension (MRC)
dataset spanning 122 language variants. Significantly expanding the language
coverage of natural language understanding (NLU) benchmarks, this dataset
enables the evaluation of text models in high-, medium-, and low-resource
languages. Each question is based on a short passage from the Flores-200
dataset and has four multiple-choice answers. The questions were carefully
curated to discriminate between models with different levels of general
language comprehension. The English dataset on its own proves difficult enough
to challenge state-of-the-art language models. Being fully parallel, this
dataset enables direct comparison of model performance across all languages. We
use this dataset to evaluate the capabilities of multilingual masked language
models (MLMs) and large language models (LLMs). We present extensive results
and find that despite significant cross-lingual transfer in English-centric
LLMs, much smaller MLMs pretrained on balanced multilingual data still
understand far more languages. We also observe that larger vocabulary size and
conscious vocabulary construction correlate with better performance on
low-resource languages. Overall, Belebele opens up new avenues for evaluating
and analyzing the multilingual capabilities of NLP systems.Comment: 27 pages, 13 figure
- …