We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test.
The dataset contains 10438 unique sentences for 289 idioms and expressions for
which we generate 15 different types of distractors, resulting in a large
cloze-style corpus. Many baseline models of cloze test reading comprehension
apply BERT with random initialization to learn embedding representation. But
idioms and fixed expressions are different such that the literal meaning of the
phrases may or may not be consistent with their contextual meaning. Therefore,
we explore different ways to combine static and contextual representations for
a stronger baseline model. Experimentations show that combining definition and
random initialization will better support cloze test model performance for
idioms whether independently or mixed with fixed expressions. While for fixed
expressions with no special meaning, static embedding with random
initialization is sufficient for cloze test model.Comment: Accepted to "2022 International Conference on Asian Language
Processing (IALP)