2 research outputs found
State-of-the-Art Vietnamese Word Segmentation
Word segmentation is the first step of any tasks in Vietnamese language
processing. This paper reviews stateof-the-art approaches and systems for word
segmentation in Vietnamese. To have an overview of all stages from building
corpora to developing toolkits, we discuss building the corpus stage,
approaches applied to solve the word segmentation and existing toolkits to
segment words in Vietnamese sentences. In addition, this study shows clearly
the motivations on building corpus and implementing machine learning techniques
to improve the accuracy for Vietnamese word segmentation. According to our
observation, this study also reports a few of achivements and limitations in
existing Vietnamese word segmentation systems.Comment: 2016 2nd International Conference on Science in Information
Technology (ICSITech
Using Search Engine to Construct a Scalable Corpus for Vietnamese Lexical Development for Word Segmentation Doan Nguyen Hewlett-Packard Company
As the web content becomes more accessible to the Vietnamese community across the globe, there is a need to process Vietnamese query texts properly to find relevant information. The recent deployment of a Vietnamese translation tool on a well-known search engine justifies its importance in gaining popularity with the World Wide Web. There are still problems in the translation and retrieval of Vietnamese language as its word recognition is not fully addressed. In this paper we introduce a semi-supervised approach in building a general scalable web corpus for Vietnamese using search engine to facilitate the word segmentation process. Moreover, we also propose a segmentation algorithm which recognizes effectively Out-Of-Vocabulary (OOV) words. The result indicates that our solution is scalable and can be applied for real time translation program and other linguistic applications. This work is here is a continuation of the work of Nguyen D. (2008).