105 research outputs found

    Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field

    Get PDF
    In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He.In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences. © 2011 by Huidan Liu, Minghua Nuo, Longlong Ma, Jian Wu, and Yeping He

    The Study of Graininess for Tibetan Named Entity Recognition

    Full text link
    Tibetan named entity recognition (NER), which is a fundamental part in Tibetan natural language processing, is the important subtask of Information extraction. In this paper, we surveyed the methods, effect and problems of Tibetan NER. And we discussed which kind of tokens that should be taken as the graininess for Tibetan NER task. The paper used two kinds of different graininess in a comparative experiment for Tibetan person names, location names and organization names, based on syllables, or based on words. From the result, we know that the person names based on syllable have better result than that based on words. Location names have small difference while species differ. But the organization names are more suitable based on words

    Introduction (to Special Issue on Tibetan Natural Language Processing)

    Get PDF
    This introduction surveys research on Tibetan NLP, both in China and in the West, as well as contextualizing the articles contained in the special issue

    The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries

    Get PDF
    The first alphabetized dictionary of Tibetan appeared in 1829 (cf. Bray 2008) and the intervening 184 years have witnessed the publication of scores of other Tibetan dictionaries (cf. Simon 1964). Hundreds of Tibetan dictionaries are now available; these include bilin gual dictionaries, both to and from such languages as English, French, German, Latin, Japanese, etc. and specialized dictionaries focusing on medicine, plants, dialects, archaic terms, neologisms, etc. (cf. Walter 2006, McGrath 2008). However, if one classifies Tibetan dictionaries by the methods of their compilation the accomplishments of Tibetan lexicography are less impressive. Methodologies of dictionary compilation divide heuristically into three types. First, some dictionaries lack explicit methodology; these works assemble words in an ad hoc manner and illustrate them with invented examples. Second, there are dictionaries that are compiled over very long periods of time on the basis of collections of slips recording attestations of words as used in context. Third, more recent dictionaries are compiled on the basis of electronic text corpora, which are processed computationally to aid in the precision, consistency and speed of dictionary compilation. These methods may be called respectively the 'informal method', the 'traditional method', and the 'modern method'. The overwhelming majority of Tibetan dictionaries were compiled with the informal method. Only five Tibetan dictionaries use the traditional methodology. No Tibetan dictionary yet compiled makes use of the modern method

    Segmenting and POS tagging Classical Tibetan using a Memory-Based Tagger

    Get PDF
    This paper presents a new approach to two challenging NLP tasks in Classical Tibetan: word segmentation and Part-of-Speech (POS) tagging. We demonstrate how both these problems can be approached in the same way, by generating a memory-based tagger that assigns 1) segmentation tags and 2) POS tags to a test corpus consisting of unsegmented lines of Tibetan characters. We propose a three-stage workflow and evaluate the results of both the segmenting and the POS tagging tasks. We argue that the Memory-Based Tagger (MBT) and the proposed workflow not only provide an adequate solution to these NLP challenges, they are also highly efficient tools for building larger annotated corpora of Tibetan.ERC grants IDs 609823 & 269752

    MiLMo:Minority Multilingual Pre-trained Language Model

    Full text link
    Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the model can understand multiple languages at the same time. At present, the search on pre-trained models mainly focuses on rich resources, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on http://milmo.cmli-nlp.com/

    The lexicography of Tibetan

    Get PDF
    This chapter provides an overview of Tibetan lexicography, from the ninth century to today. While most Tibetan dictionaries were compiled in an ad hoc manner, some used citation collections. Electronic corpora have been built for Tibetan, but they have not as yet been used to assist dictionary compilation. The various obstacles that need to be overcome first in order to be able to compile corpus-based dictionaries are discussed

    Emotion detection on social media status in Myanmar language

    Get PDF
    Many social media emerged and provided services during these years. Most people, especially in Myanmar, use them to express their emotions or moods, learn subjects, sell products, read up-to-date news, and communicate with each other. Emotion detection on social users makes critical tasks in the opinion mining and sentiment analysis. This paper presents the emotion detection system on social media (Facebook) user status or post written in Myanmar (Burmese) language. Before the emotion detection process, the user posts are pre-processed under segmentation, stemming, part-of-speech (POS) tagging, and stop word removal. The system then uses our preconstructed Myanmar word-emotion Lexicon, M-Lexicon, to extract the emotion words from the segmented POS post. The system provides six types of emotion such as surprise, disgust, fear, anger, sadness, and happiness. The system applies naïve Bayes (NB) emotion classifier to examine the emotion in the case of more than two words with different emotion values are extracted. The classifiers also classify the emotion of the users on their posts. The experiment shows that the system can detect 85% accuracy in NB based emotion detection while 86% in recurrent neural network (RNN)
    • …
    corecore