2 research outputs found

    Improvement of Statistical Machine Translation using Charater-Based Segmentation with Monolingual and Bilingual Information

    No full text
    We present a novel segmentation ap-proach for Phrase-Based Statistical Ma-chine Translation (PB-SMT) to languages where word boundaries are not obviously marked by using both monolingual and bi-lingual information and demonstrate that (1) unsegmented corpus is able to provide the nearly identical result compares to manually segmented corpus in PB-SMT task when a good heuristic character clus-tering algorithm is applied on it, (2) the performance of PB-SMT task has signifi-cantly increased when bilingual infor-mation are used on top of monolingual segmented result. Our technique, instead of focusing on word separation, mainly concentrate on a group of character. First, we group several characters that reside in an unsegmented corpus by employing pre-determined constraints and certain heuris-tics algorithms. Secondly, we enhance the segmented result by incorporating the character group repacking based on align-ment confidence. We evaluate the effec-tiveness of our method on PB-SMT task using English-Thai, English-Lao and Eng-lish-Burmese language pairs and report the best improvement of 8.1 % increase in BLEU score on English-Thai pair.
    corecore