Search CORE

105 research outputs found

Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field

Author: He Yeping
Liu Huidan
Ma Longlong
Nuo Minghua
Wu Jian
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He.In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He

Waseda University Repository

Institute Of Software, Chinese Academy Of Sciences

The Study of Graininess for Tibetan Named Entity Recognition

Author: Cai
Dou
Li
Li
Liu
Shi
Publication venue: 'EDP Sciences'
Publication date: 01/01/2017
Field of study

Tibetan named entity recognition (NER), which is a fundamental part in Tibetan natural language processing, is the important subtask of Information extraction. In this paper, we surveyed the methods, effect and problems of Tibetan NER. And we discussed which kind of tokens that should be taken as the graininess for Tibetan NER task. The paper used two kinds of different graininess in a comparative experiment for Tibetan person names, location names and organization names, based on syllables, or based on words. From the result, we know that the person names based on syllable have better result than that based on words. Location names have small difference while species differ. But the organization names are more suitable based on words

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

Directory of Open Access Journals

Introduction (to Special Issue on Tibetan Natural Language Processing)

Author: Di Jiang
Hill Nathan W.
Publication venue: 'eScholarship'
Publication date: 01/01/2016
Field of study

This introduction surveys research on Tibetan NLP, both in China and in the West, as well as contextualizing the articles contained in the special issue

SOAS Research Online

eScholarship - University of California

The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries

Author: Garrett Edward
Hill Nathan W.
Kilgarriff Adam
Vadlapudi Ravikiran
Zadoks Abel
Publication venue: 'INIST-CNRS'
Publication date: 01/01/2015
Field of study

The first alphabetized dictionary of Tibetan appeared in 1829 (cf. Bray 2008) and the intervening 184 years have witnessed the publication of scores of other Tibetan dictionaries (cf. Simon 1964). Hundreds of Tibetan dictionaries are now available; these include bilin gual dictionaries, both to and from such languages as English, French, German, Latin, Japanese, etc. and specialized dictionaries focusing on medicine, plants, dialects, archaic terms, neologisms, etc. (cf. Walter 2006, McGrath 2008). However, if one classifies Tibetan dictionaries by the methods of their compilation the accomplishments of Tibetan lexicography are less impressive. Methodologies of dictionary compilation divide heuristically into three types. First, some dictionaries lack explicit methodology; these works assemble words in an ad hoc manner and illustrate them with invented examples. Second, there are dictionaries that are compiled over very long periods of time on the basis of collections of slips recording attestations of words as used in context. Third, more recent dictionaries are compiled on the basis of electronic text corpora, which are processed computationally to aid in the precision, consistency and speed of dictionary compilation. These methods may be called respectively the 'informal method', the 'traditional method', and the 'modern method'. The overwhelming majority of Tibetan dictionaries were compiled with the informal method. Only five Tibetan dictionaries use the traditional methodology. No Tibetan dictionary yet compiled makes use of the modern method

SOAS Research Online

Segmenting and POS tagging Classical Tibetan using a Memory-Based Tagger

Author: Hill Nathan
Meelen Marieke
Publication venue: Himalayan Linguistics
Publication date: 01/01/2017
Field of study

This paper presents a new approach to two challenging NLP tasks in Classical Tibetan: word segmentation and Part-of-Speech (POS) tagging. We demonstrate how both these problems can be approached in the same way, by generating a memory-based tagger that assigns 1) segmentation tags and 2) POS tags to a test corpus consisting of unsegmented lines of Tibetan characters. We propose a three-stage workflow and evaluate the results of both the segmenting and the POS tagging tasks. We argue that the Memory-Based Tagger (MBT) and the proposed workflow not only provide an adequate solution to these NLP challenges, they are also highly efficient tools for building larger annotated corpora of Tibetan.ERC grants IDs 609823 & 269752

SOAS Research Online

eScholarship - University of California

Apollo (Cambridge)

MiLMo:Minority Multilingual Pre-trained Language Model

Author: Bao Wugedele
Deng Junjie
Shi Hanru
Sun Yuan
Yu Xinhe
Zhao Xiaobing
Publication venue
Publication date: 10/04/2023
Field of study

Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/

arXiv.org e-Print Archive

The lexicography of Tibetan

Author: Garrett Edward
Hill Nathan W.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/09/2017
Field of study

This chapter provides an overview of Tibetan lexicography, from the ninth century to today. While most Tibetan dictionaries were compiled in an ad hoc manner, some used citation collections. Electronic corpora have been built for Tibetan, but they have not as yet been used to assist dictionary compilation. The various obstacles that need to be overcome first in order to be able to compile corpus-based dictionaries are discussed

SOAS Research Online

Emotion detection on social media status in Myanmar language

Author: Swe Thiri Marlar
Wah Naw Lay
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/10/2023
Field of study

Many social media emerged and provided services during these years. Most people, especially in Myanmar, use them to express their emotions or moods, learn subjects, sell products, read up-to-date news, and communicate with each other. Emotion detection on social users makes critical tasks in the opinion mining and sentiment analysis. This paper presents the emotion detection system on social media (Facebook) user status or post written in Myanmar (Burmese) language. Before the emotion detection process, the user posts are pre-processed under segmentation, stemming, part-of-speech (POS) tagging, and stop word removal. The system then uses our preconstructed Myanmar word-emotion Lexicon, M-Lexicon, to extract the emotion words from the segmented POS post. The system provides six types of emotion such as surprise, disgust, fear, anger, sadness, and happiness. The system applies naïve Bayes (NB) emotion classifier to examine the emotion in the case of more than two words with different emotion values are extracted. The classifiers also classify the emotion of the users on their posts. The experiment shows that the system can detect 85% accuracy in NB based emotion detection while 86% in recurrent neural network (RNN)

Institute of Advanced Engineering and Science