105 research outputs found
Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field
In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He.In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He
The Study of Graininess for Tibetan Named Entity Recognition
Tibetan named entity recognition (NER), which is a fundamental part in Tibetan natural language processing, is the important subtask of Information extraction. In this paper, we surveyed the methods, effect and problems of Tibetan NER. And we discussed which kind of tokens that should be taken as the graininess for Tibetan NER task. The paper used two kinds of different graininess in a comparative experiment for Tibetan person names, location names and organization names, based on syllables, or based on words. From the result, we know that the person names based on syllable have better result than that based on words. Location names have small difference while species differ. But the organization names are more suitable based on words
Introduction (to Special Issue on Tibetan Natural Language Processing)
This introduction surveys research on Tibetan NLP, both in China and in the West, as well as contextualizing the articles contained in the special issue
The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries
The first alphabetized dictionary of Tibetan appeared in 1829 (cf. Bray 2008) and the intervening 184 years have witnessed the publication of scores of other Tibetan dictionaries (cf. Simon 1964). Hundreds of Tibetan dictionaries are now available; these include bilin
gual dictionaries, both to and from such languages
as English, French, German, Latin, Japanese, etc. and specialized dictionaries focusing on medicine, plants, dialects, archaic terms, neologisms, etc. (cf. Walter 2006, McGrath 2008). However, if one classifies Tibetan dictionaries by the methods of their compilation the
accomplishments of Tibetan lexicography are less impressive.
Methodologies of dictionary compilation divide heuristically into three types. First, some dictionaries lack explicit methodology; these works assemble words in an
ad hoc manner and illustrate them with invented examples. Second, there are dictionaries that are compiled over very long periods of time on the basis of collections of slips
recording attestations of words as used in context. Third, more recent dictionaries are compiled on the basis of electronic text corpora, which are processed computationally to aid in the precision, consistency and speed of dictionary compilation. These methods may be called respectively the 'informal method', the 'traditional method', and the 'modern method'. The overwhelming majority of Tibetan dictionaries were compiled with the informal method. Only five Tibetan dictionaries use the traditional methodology. No Tibetan dictionary yet compiled makes
use of the modern method
Segmenting and POS tagging Classical Tibetan using a Memory-Based Tagger
This paper presents a new approach to two challenging NLP tasks in Classical Tibetan: word segmentation and Part-of-Speech (POS) tagging. We demonstrate how both these problems can be approached in the same way, by generating a memory-based tagger that assigns 1) segmentation tags and 2) POS tags to a test corpus consisting of unsegmented lines of Tibetan characters. We propose a three-stage workflow and evaluate the results of both the segmenting and the POS tagging tasks. We argue that the Memory-Based Tagger (MBT) and the proposed workflow not only provide an adequate solution to these NLP challenges, they are also highly efficient tools for building larger annotated corpora of Tibetan.ERC grants IDs 609823 & 269752
MiLMo:Minority Multilingual Pre-trained Language Model
Pre-trained language models are trained on large-scale unsupervised data, and
they can fine-turn the model only on small-scale labeled datasets, and achieve
good results. Multilingual pre-trained language models can be trained on
multiple languages, and the model can understand multiple languages at the same
time. At present, the search on pre-trained models mainly focuses on rich
resources, while there is relatively little research on low-resource languages
such as minority languages, and the public multilingual pre-trained language
model can not work well for minority languages. Therefore, this paper
constructs a multilingual pre-trained model named MiLMo that performs better on
minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and
Korean. To solve the problem of scarcity of datasets on minority languages and
verify the effectiveness of the MiLMo model, this paper constructs a minority
multilingual text classification dataset named MiTC, and trains a word2vec
model for each language. By comparing the word2vec model and the pre-trained
model in the text classification task, this paper provides an optimal scheme
for the downstream task research of minority languages. The final experimental
results show that the performance of the pre-trained model is better than that
of the word2vec model, and it has achieved the best results in minority
multilingual text classification. The multilingual pre-trained model MiLMo,
multilingual word2vec model and multilingual text classification dataset MiTC
are published on http://milmo.cmli-nlp.com/
The lexicography of Tibetan
This chapter provides an overview of Tibetan lexicography, from the ninth century to today. While most Tibetan dictionaries were compiled in an ad hoc manner, some used citation collections. Electronic corpora have been built for Tibetan, but they have not as yet been used to assist dictionary compilation. The various obstacles that need to be overcome first in order to be able to compile corpus-based dictionaries are discussed
Emotion detection on social media status in Myanmar language
Many social media emerged and provided services during these years. Most people, especially in Myanmar, use them to express their emotions or moods, learn subjects, sell products, read up-to-date news, and communicate with each other. Emotion detection on social users makes critical tasks in the opinion mining and sentiment analysis. This paper presents the emotion detection system on social media (Facebook) user status or post written in Myanmar (Burmese) language. Before the emotion detection process, the user posts are pre-processed under segmentation, stemming, part-of-speech (POS) tagging, and stop word removal. The system then uses our preconstructed Myanmar word-emotion Lexicon, M-Lexicon, to extract the emotion words from the segmented POS post. The system provides six types of emotion such as surprise, disgust, fear, anger, sadness, and happiness. The system applies naïve Bayes (NB) emotion classifier to examine the emotion in the case of more than two words with different emotion values are extracted. The classifiers also classify the emotion of the users on their posts. The experiment shows that the system can detect 85% accuracy in NB based emotion detection while 86% in recurrent neural network (RNN)
- …