271 research outputs found

    A Unified Model of Thai Romanization and Word Segmentation

    Get PDF
    Thai romanization is the way to write Thai language using roman alphabets. It could be performed on the basis of orthographic form (transliteration) or pronunciation (transcription) or both. As a result, many systems of romanization are in use. The Royal Institute has established the standard by proposing the principle of romanization on the basis of transcription. To ensure the standard, a fully automatic Thai romanization system should be publicly made available. In this paper, we discuss the problems of Thai Romanization. We argue that automatic Thai romanization is difficult because the ambiguities of pronunciation are caused not only by the ambiguities of syllable segmentation, but also by the ambiguities of word segmentation. A model of automatic romanization then is designed and implemented on this ground. The problem of romanization and word segmentation are handled simultaneously. A syllable-segmented corpus and a corpus of word-pronunciation are used for training the system. The accuracy of the system is 94.44% for unseen names and 99.58% for general texts. When the training corpus includes some proper names, the accuracy of romanizing unseen names was increased from 94.44% to 97%. Our system performs well because it is designed to better suit the problem

    English-Chinese Name Transliteration with Bi-Directional Syllable-Based Maximum Matching

    Get PDF

    Are You Finding the Right Person? A Name Translation System Towards Web 2.0

    Get PDF
    In a multilingual world, information available in global information systems is increasing rapidly. Searching for proper names in foreign language becomes an important task in multilingual search and knowledge discovery. However, these names are the most difficult to handle because they are often unknown words that cannot be found in a translation dictionary and even human experts cannot handle the variation generated during translation. Furthermore, existing research on name translation have focused on translation algorithms. However, user experience during name translation and name search are often ignored. With the Web technology moving towards Web 2.0, creating a platform that allow easier distributed collaboration and information sharing, we seek methods to incorporate Web 2.0 technologies into a name translation system. In this research, we review challenges in name translation and propose an interactive name translation and search system: NameTran. This system takes English names and translates them into Chinese using a combined hybrid Hidden Markov Model-based (HMM-based) transliteration approach and a web mining approach. Evaluation results showed that web mining consistently boosted the performance of a pure HMM approach. Our system achieved top-1 accuracy of 0.64 and top-8 accuracy of 0.96. To cope with changing popularity and variation in name translations, we demonstrated the feasibility of allowing users to rank translations and the new ranking serves as feedback to the original trained HMM model. We believe that such user input will significantly improve system usability

    Semantic similarity framework for Thai conversational agents

    Get PDF
    Conversational Agents integrate computational linguistics techniques and natural language to support human-like communication with complex computer systems. There are a number of applications in business, education and entertainment, including unmanned call centres, or as personal shopping or navigation assistants. Initial research has been performed on Conversational Agents in languages other than English. There has been no significant publication on Thai Conversational Agents. Moreover, no research has been conducted on supporting algorithms for Thai word similarity measures and Thai sentence similarity measures. Consequently, this thesis details the development of a novel Thai sentence semantic similarity measure that can be used to create a Thai Conversational Agent. This measure, Thai Sentence Semantic Similarity measure (TSTS) is inspired by the seminal English measure, Sentence Similarity based on Semantic Nets and Corpus Statistics (STASIS). A Thai sentence benchmark dataset, called 65 Thai Sentence pairs benchmark dataset (TSS-65), is also presented in this thesis for the evaluation of TSTS. The research starts with the development a simple Thai word similarity measure called TWSS. Additionally, a novel word measure called a Semantic Similarity Measure, based on a Lexical Chain Created from a Search Engine (LCSS), is also proposed using a search engine to create the knowledge base instead of WordNet. LCSS overcomes the problem that a prototype version of Thai Word semantic similarity measure (TWSS) has with the word pairs that are related to Thai culture. Thai word benchmark datasets are also presented for the evaluation of TWSS and LCSS called the 30 Thai Word Pair benchmark dataset (TWS-30) and 65 Thai Word Pair benchmark dataset (TWS-65), respectively. The result of TSTS is considered a starting point for a Thai sentence measure which can be illustrated to create semantic-based Conversational Agents in future. This is illustrated using a small sample of real English Conversational Agent human dialogue utterances translated into Thai

    Comparison between English Loanwords in Thai and Indonesian: A Comparative Study in Phonology and Morphology

    Get PDF
    Loanwords are very influential in language learning because learners have a tendency to pronounce or write target language’s words based on the corresponding loanwords in their first languages. For that reason, research on English loanwords in both Thai and Indonesian is a potential source for Thai and Indonesian language learning, and even English as a foreign language (EFL) learning in Thailand and Indonesia. The objective of this research is to find out the differences and similarities between English loanwords in Thai and those in Indonesian in terms of their phonological and morphological adaptation. Since not all suprasegmental features, such as tone, exist in both languages, the phonological analysis pays more attention on vowel and consonant changes. The morphological analysis on the other hand focuses on morphological changes of polymorphemic words. The data of this research were collected from previous studies and interviews with native speakers of each language. The finding shows that both Thai and Indonesian have different as well as similar processes of phonological and morphological adaptation of English loanwords.
    • …
    corecore