18 research outputs found
A Span-Extraction Dataset for Chinese Machine Reading Comprehension
Machine Reading Comprehension (MRC) has become enormously popular recently
and has attracted a lot of attention. However, the existing reading
comprehension datasets are mostly in English. In this paper, we introduce a
Span-Extraction dataset for Chinese machine reading comprehension to add
language diversities in this area. The dataset is composed by near 20,000 real
questions annotated on Wikipedia paragraphs by human experts. We also annotated
a challenge set which contains the questions that need comprehensive
understanding and multi-sentence inference throughout the context. We present
several baseline systems as well as anonymous submissions for demonstrating the
difficulties in this dataset. With the release of the dataset, we hosted the
Second Evaluation Workshop on Chinese Machine Reading Comprehension (CMRC
2018). We hope the release of the dataset could further accelerate the Chinese
machine reading comprehension research. Resources are available:
https://github.com/ymcui/cmrc2018Comment: 6 pages, accepted as a conference paper at EMNLP-IJCNLP 2019 (short
paper
Revisiting Pre-Trained Models for Chinese Natural Language Processing
Bidirectional Encoder Representations from Transformers (BERT) has shown
marvelous improvements across various NLP tasks, and consecutive variants have
been proposed to further improve the performance of the pre-trained language
models. In this paper, we target on revisiting Chinese pre-trained language
models to examine their effectiveness in a non-English language and release the
Chinese pre-trained language model series to the community. We also propose a
simple but effective model called MacBERT, which improves upon RoBERTa in
several ways, especially the masking strategy that adopts MLM as correction
(Mac). We carried out extensive experiments on eight Chinese NLP tasks to
revisit the existing pre-trained language models as well as the proposed
MacBERT. Experimental results show that MacBERT could achieve state-of-the-art
performances on many NLP tasks, and we also ablate details with several
findings that may help future research. Resources available:
https://github.com/ymcui/MacBERTComment: 12 pages, to appear at Findings of EMNLP 202
Cross-Lingual Machine Reading Comprehension
Though the community has made great progress on Machine Reading Comprehension
(MRC) task, most of the previous works are solving English-based MRC problems,
and there are few efforts on other languages mainly due to the lack of
large-scale training data. In this paper, we propose Cross-Lingual Machine
Reading Comprehension (CLMRC) task for the languages other than English.
Firstly, we present several back-translation approaches for CLMRC task, which
is straightforward to adopt. However, to accurately align the answer into
another language is difficult and could introduce additional noise. In this
context, we propose a novel model called Dual BERT, which takes advantage of
the large-scale training data provided by rich-resource language (such as
English) and learn the semantic relations between the passage and question in a
bilingual context, and then utilize the learned knowledge to improve reading
comprehension performance of low-resource language. We conduct experiments on
two Chinese machine reading comprehension datasets CMRC 2018 and DRCD. The
results show consistent and significant improvements over various
state-of-the-art systems by a large margin, which demonstrate the potentials in
CLMRC task. Resources available: https://github.com/ymcui/Cross-Lingual-MRCComment: 10 pages, accepted as a conference paper at EMNLP-IJCNLP 2019 (long
paper