Search CORE

4 research outputs found

Dịch lại từ chưa biết dạng biểu thức số trong dịch thống kê Hoa-Việt

Author: Trần Phước Thanh
Đinh Điền
Publication venue: 'Publishing House for Science and Technology, Vietnam Academy of Science and Technology'
Publication date: 10/06/2014
Field of study

Word boundary in Chinese and Vietnamese is not defined by a space. Therefore, Chinese-Vietnamese word segmentations are always done first in Chinese-Vietnamese natural language processing problem in general and in Chinese-Vietnamese statistical machine translation in particular. The word segmentation increases the final quality of translation, but it appears many unknown words (UKW) in the target translation. The type of popular unknown word in Chinese-Vietnamese translation system is named entity (NE). In this paper, we present a hybrid method to combine statistic and rule and to re-translate number expression NE-UKW (NumExp-NE-UKW). Applying this method into Chinese-Vietnamese SMT, the experiment result shows that our method significantly improves Chinese-Vietnamese SMT performance.Ranh giới từ trong tiếng Hoa và tiếng Việt không được xác định bởi khoảng trắng. Do đó, phân đoạn từ Hoa-Việt luôn được thực hiện đầu tiên trong bài toán xử lý ngôn ngữ Hoa-Việt nói chung và trong dịch máy thống kê Hoa-Việt (Statistical Machine Translation: SMT) nói riêng. Việc phân đoạn từ làm tăng chất lượng dịch chung cuộc nhưng lại xuất hiện nhiều từ chưa biết (Unknown Word: UKW) ở bản dịch đích. Dạng từ chưa biết phổ biến trong hệ thống dịch Hoa-Việt đó là tên riêng (named entity:NE). Trong bài báo này, chúng tôi sẽ trình bày một phương pháp dựa vào luật nhằm dịch lại các UKW dạng tên riêng biểu thức số. Áp dụng phương pháp này vào trong hệ dịch thống kê Hoa-Việt, kết quả thử nghiệm cho thấy phương pháp của chúng tôi đã cải tiến đáng kể hiệu suất dịch máy thống kê Hoa-Việt

Vietnam Academy of Science and Technology: Journals Online