Search CORE

660 research outputs found

Semi-supervised Thai Sentence Segmentation Using Local and Distant Word Representations

Author: Chalothorn Tawunrat
Chuangsuwanich Ekapol
Saetia Chanatip
Vateekul Peerapon
Publication venue: 'Faculty of Engineering, Chulalongkorn University'
Publication date: 30/06/2021
Field of study

A sentence is typically treated as the minimal syntactic unit used to extract valuable information from long text. However, in written Thai, there are no explicit sentence markers. Some prior works use machine learning; however, a deep learning approach has never been employed. We propose a deep learning model for sentence segmentation that includes three main contributions. First, we integrate n-gram embedding as a local representation to capture word groups near sentence boundaries. Second, to focus on the keywords of dependent clauses, we combine the model with a distant representation obtained from self-attention modules. Finally, due to the scarcity of labeled data, for which annotation is difficult and time-consuming, we also investigate two techniques that allow us to utilize unlabeled data: Cross-View Training (CVT) as a semi-supervised learning technique, and a pre-trained language model (ELMo) to improve word representation. In the experiments, our model reduced the relative error by 7.4% and 18.5% compared with the baseline models on the Orchid and UGWC datasets, respectively. Ablation studies revealed that the main contributing factor was adopting n-gram features, which were further analyzed using the interpretation technique and indicated that the model utilizes the features in the same way that humans do

Engineering Journal (Faculty of Engineering, Chulalongkorn University, Bangkok)