Search CORE

1,298 research outputs found

A Large-scale Study of Statistical Machine Translation Methods for Khmer Language

Author: Chea Vichet
Finch Andrew
Sumita Eiichiro
Thu Ye Kyaw
Utiyama Masao
Publication venue
Publication date: 01/01/2015
Field of study

Thai to Khmer Rule-Based Machine Translation Using Reordering Word to Phrase

Author: Sukchatri Prasomsuk
Sukchatri Prasomsuk
Publication venue: Global Journals Inc. (US)
Publication date: 15/07/2017
Field of study

In this paper an effective machine translation system from Thai to Khmer language on a website is proposed To create a web application for a high performance Thai- Khmer machine translation ThKh-MT the principles and methods of translation involve with lexical base Word reordering is applied by considering the previous word the next word and subject-verb agreement The word adjustment is also required to attain acceptable outputs Additional steps related to structure patterns are added in a combination with the classical methods to deal with translation issues PHP is implemented to build the application with MySQL as a tool to create lexical databases For testing 5 100 phrases and sentences are selected to evaluate the system The result shows 89 25 percent of accuracy and 0 84 for F-Measure which infers to a higher efficiency than that of Google and other system

Global Journal of Computer Science and Technology (GJCST)

Khmer Treebank Construction via Interactive Tree Visualization

Author: Chay-intr Thodsaporn
Kaing Hour
Kann Bonpagna
Theeramunkong Thanaruk
Publication venue: 'Universitas Gadjah Mada'
Publication date: 11/12/2019
Field of study

Despite the fact that there are a number of researches working on Khmer Language in the field of Natural Language Processing along with some resources regarding words segmentation and POS Tagging, we still lack of high-level resources regarding syntax, Treebanks and grammars, for example. This paper illustrates the semi-automatic framework of constructing Khmer Treebank and the extraction of the Khmer grammar rules from a set of sentences taken from the Khmer grammar books. Initially, these sentences will be manually annotated and processed to generate a number of grammar rules with their probabilities once the Treebank is obtained. In our experiments, the annotated trees and the extracted grammar rules are analyzed in both quantitative and qualitative way. Finally, the results will be evaluated in three evaluation processes including Self-Consistency, 5-Fold Cross-Validation, Leave-One-Out Cross-Validation along with the three validation methods such as Precision, Recall, F1-Measure. According to the result of the three validations, Self-Consistency has shown the best result with more than 92%, followed by the Leave-One-Out Cross-Validation and 5-Fold Cross Validation with the average of 88% and 75% respectively. On the other hand, the crossing bracket data shows that Leave-One-Out Cross Validation holds the highest average with 96% while the other two are 85% and 89%, respectively

Crossref

IJITEE (International Journal of Information Technology and Electrical Engineering)

On the Relation between Linguistic Typology and (Limitations of) Multilingual Language Modeling

Author: Gerz Daniela
Korhonen Anna
Ponti Edoardo Maria
Reichart Roi
Vulić Ivan
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Crossref

Edinburgh Research Explorer

Development and Validation of a Scale of Subjective Well-being for Cambodian Refugees

Author: Sicobo Gertha Anne
Publication venue
Publication date: 01/01/2008
Field of study

This is a study of the Subjective Well-Being (SWB) of refugees from Cambodia. A correlational study design composed of questionnaires was used to assess subjective well-being in a Cambodian population in the USA. The purpose of this study was to develop and validate a newly constructed Scale of Subjective Well-Being for Khmer Refugees (SSWB-KR), to be used with Cambodian refugees living in the US. The scale is a 49-item, 4-pt. Likert -Type scale that was administered to a sample of 20 Cambodian refugees in Philadelphia, PA. It was administered along with three other measures, the Satisfaction With Life Scale (SWLS), (Diener, Emmons, Larsen, & Griffin, 1985), the Hopkins Symptom Checklist -25) (HSCL-25) (Mollica, Wyshak, deMameffe, Khuon, & Lavelles, 1987b), the Khmer Acculturation Scale (KAS) (Lim, Heibi, Brislin, & Griffin (2002). A demographics questionnaire was also administered. The SSWB-KR was validated against the Satisfaction With Life Scale (SWLS) (Diener, Emmons, Larsen, & Griffin, 1985). A group of expert informants provided information that was used to create items that were classified under 11 domains of SWB. Correlations were obtained among the above scales. The SSWB-KR achieved significant positive correlations with the SWLS. No relationship was found between the SSWB-KR and the KAS. Results were also obtained for demographics and SWB. The SSWB-KR could be a useful clinical and research tool. Implications for CBT and recommendations for further validation are discussed

Philadelphia College of Osteopathic Medicine: DigitalCommons@PCOM

Mismatching-Aware Unsupervised Translation Quality Estimation For Low-Resource Languages

Author: Azadi Fatemeh
Dousti Mohammad Javad
Faili Heshaam
Publication venue
Publication date: 12/08/2023
Field of study

Translation Quality Estimation (QE) is the task of predicting the quality of machine translation (MT) output without any reference. This task has gained increasing attention as an important component in the practical applications of MT. In this paper, we first propose XLMRScore, which is a cross-lingual counterpart of BERTScore computed via the XLM-RoBERTa (XLMR) model. This metric can be used as a simple unsupervised QE method, while employing it results in two issues: firstly, the untranslated tokens leading to unexpectedly high translation scores, and secondly, the issue of mismatching errors between source and hypothesis tokens when applying the greedy matching in XLMRScore. To mitigate these issues, we suggest replacing untranslated words with the unknown token and the cross-lingual alignment of the pre-trained model to represent aligned words closer to each other, respectively. We evaluate the proposed method on four low-resource language pairs of WMT21 QE shared task, as well as a new English-Farsi test dataset introduced in this paper. Experiments show that our method could get comparable results with the supervised baseline for two zero-shot scenarios, i.e., with less than 0.01 difference in Pearson correlation, while outperforming unsupervised rivals in all the low-resource language pairs for above 8%, on average.Comment: Submitted to Language Resources and Evaluatio

arXiv.org e-Print Archive

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

Author: Artetxe Mikel
Schwenk Holger
Publication venue
Publication date: 11/07/1922
Field of study

We introduce an architecture to learn joint multilingual sentence representations for 93 languages, belonging to more than 30 different language families and written in 28 different scripts. Our system uses a single BiLSTM encoder with a shared BPE vocabulary for all languages, which is coupled with an auxiliary decoder and trained on publicly available parallel corpora. This enables us to learn a classifier on top of the resulting sentence embeddings using English annotated data only, and transfer it to any of the 93 languages without any modification. Our approach sets a new state-of-the-art on zero-shot cross-lingual natural language inference for all the 14 languages in the XNLI dataset but one. We also achieve very competitive results in cross-lingual document classification (MLDoc dataset). Our sentence embeddings are also strong at parallel corpus mining, establishing a new state-of-the-art in the BUCC shared task for 3 of its 4 language pairs. Finally, we introduce a new test set of aligned sentences in 122 languages based on the Tatoeba corpus, and show that our sentence embeddings obtain strong results in multilingual similarity search even for low-resource languages. Our PyTorch implementation, pre-trained encoder and the multilingual test set will be freely available

arXiv.org e-Print Archive

University of Michigan Library Repository