InDEX: Indonesian Idiom and Expression Dataset for Cloze Test

Qiu, Xinying; Shi, Guofeng

InDEX: Indonesian Idiom and Expression Dataset for Cloze Test

Authors: Xinying Qiu
Guofeng Shi
Publication date: 23 November 2022
Publisher

Abstract

We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test. The dataset contains 10438 unique sentences for 289 idioms and expressions for which we generate 15 different types of distractors, resulting in a large cloze-style corpus. Many baseline models of cloze test reading comprehension apply BERT with random initialization to learn embedding representation. But idioms and fixed expressions are different such that the literal meaning of the phrases may or may not be consistent with their contextual meaning. Therefore, we explore different ways to combine static and contextual representations for a stronger baseline model. Experimentations show that combining definition and random initialization will better support cloze test model performance for idioms whether independently or mixed with fixed expressions. While for fixed expressions with no special meaning, static embedding with random initialization is sufficient for cloze test model.Comment: Accepted to "2022 International Conference on Asian Language Processing (IALP)

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2211.13376

Last time updated on 30/12/2022