Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection

Ketui, Nongnuch; Theeramunkong, Thanaruk

Thai Multi-Document Summarization: Unit Segmentation, Unit-Graph Formulation, and Unit Selection

Authors: Nongnuch Ketui
Thanaruk Theeramunkong
Publication date: 31 May 2016
Publisher: Institute of Informatics, Slovak Academy of Sciences

Abstract

There have been several challenges in summarization of Thai multiple documents since Thai language itself lacks of explicit word/phrase/sentence boundaries. This paper gives definition of Thai Elementary Discourse Unit (TEDU) and then presents our three-stage summarization process. Towards implementation of this process, we propose unit segmentation using TEDUs and their derivatives, unit-graph formation using iterative unit weighting and cosine similarity, and unit selection using highest-weight priority, redundancy removal, and post-selection weight recalculation. To examine performance of the proposed methods, a number of experiments are conducted using fifty sets of Thai news articles with their manually constructed reference summary. By three common evaluation measures of ROUGE-1, ROUGE-2, and ROUGE-SU4, the results evidence that (1) our TEDU-based summarization outperforms paragraph-based summarization, (2) our iterative weighting is superior to traditional TF-IDF, (3) the highest-weight priority without centroid preference and unit redundancy consideration helps improving summary quality, and (4) post-selection weight recalculation tends to raise summarization performance under some certain circumstances

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

oai:ojs.cai.ui.sav.sk:article/...

Last time updated on 15/12/2019