7,988 research outputs found

    Penalizing unknown words’ emissions in hmm pos tagger based on Malay affix morphemes

    Get PDF
    The challenge in unsupervised Hidden Markov Model (HMM) training for a POS tagger isthat the training depends on an untagged corpus; the only supervised data limiting  possible tagging of words is a dictionary. Therefore, training cannot properly map  possible tags. The exact morphemes of prefixes, suffixes and circumfixes in the   agglutinative Malay language is examined to assign unknown words’ probable tags based on linguistically meaningful affixes using a morpheme-based POS guessing algorithm for tagging. The algorithm has been integrated into Viterbi algorithm which uses HMM trained parameters for tagging new sentences. In the experiment, this tagger is first, uses character-based prediction to handle unknown words; next, uses morpheme-based POS guessing algorithm; lastly, combination of the first and second.Keywords: Malay POS tagger; morpheme-based; HMM

    A Machine learning approach to POS tagging

    Get PDF
    We have applied inductive learning of statistical decision trees and relaxation labelling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities. This model consists of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired language models are complete enough to be directly used as sets of POS disambiguation rules, and include more complex contextual information than simple collections of n-grams usually used in statistical taggers. We have implemented a quite simple and fast tagger that has been tested and evaluated on the Wall Street Journal (WSJ) corpus with a remarkable accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labelling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine learned decision trees. Simultaneously, we address the problem of tagging when only small training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that quite high accuracy can be achieved with our system in this situation.Postprint (published version

    IMPLEMENTASI ANALISIS MORFOLOGI DALAM MENANGANI OUT-OF-VOCABULARY WORDS PADA PART-OF-SPEECH TAGGER BAHASA INDONESIA MENGGUNAKAN HIDDEN MARKOV MODEL

    Get PDF
    Part-of-speech (PoS) tagger merupakan salah satu task dalam bidang natural language processing (NLP) sebagai proses penandaan kategori kata (part-of-speech) untuk setiap kata pada teks kalimat masukan. Hidden markov model (HMM) merupakan algoritma PoS tagger berbasis probabilistik, sehingga sangat tergantung pada train corpus. Terbatasnya komponen dalam train corpus dan luasnya kata dalam bahasa Indonesia menimbulkan masalah yang disebut out-of-vocabulary (OOV) words. Untuk mengatasi permasalahan tersebut dibutuhkan sebuah metode yaitu Analisis Morofologi. Penelitian ini membuat dua sistem yaitu PoS tagger HMM menggunakan metode Analsis Morfologi (AM) dan PoS tagger HMM tanpa AM, dengan menggunakan train corpus dan testing corpus yang sama. Testing corpus mengandung 30% tingkat OOV dari 6.676 token atau 740 kalimat masukan. Hasil yang diperoleh dari sistem HMM saja memiliki akurasi 97.54%, sedangkan sistem HMM dengan metode analisis morfologi memiliki akurasi tertinggi 99.14%.; Part-of-speech (PoS) tagger is one of tasks in the field of natural language processing (NLP) as the process of part-of-speech tagging for each word in the inputed sentence. Hidden markov model (HMM) is a probabilistic based PoS tagger algorithm, so it really depends on the train corpus. The limited components in the train corpus and the breadth of words in the Indonesian language pose a problem called out-of-vocabulary (OOV) words. To overcome this problem, a method is needed, namely Morophological Analysis. This research includes developing two systems, those are PoS tagger HMM using Morphological Analysis (AM) method and HMM PoS tagger without AM, using the same train and testing corpus. Testing corpus contains 30% OOV level out of 6,676 tokens or 740 sentences. The result obtained from the HMM system has 97.54% of accuracy, while the HMM system with morphological analysis method has 99.14% as it’s highest accuracy

    Automatic processing of code-mixed social media content

    Get PDF
    Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging

    Comparison of different POS tagging techniques for some South Asian languages

    Get PDF
    This thesis report is submitted in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science and Engineering, 2006.Cataloged from PDF version of thesis report.Includes bibliographical references (page 47).There are different approaches to the problem of assigning a part of speech (POS) tag to each word of a natural language sentence. We present a comparison of the different approaches of POS tagging for the Bangla language and two other South Asian languages, as well as the baseline performances of different POS tagging techniques for the English language. The most widely used methods for English are the statistical methods i.e. n-gram based tagging or Hidden Markov Model (HMM) based tagging, the rule based or transformation based methods i.e. Brill’s tagger. Subsequent researches add various modifications to these basic approaches to improve the performance of the taggers for English. Here, we present an elaborate review of previous work in the area with the focus on South Asian Languages such as Hindi and Bangla. We experiment with Brill’s transformation based tagger and the supervised HMM based tagger without modifications for added improvement in accuracy, on English using training corpora of different sizes from the Brown corpus. We also compare the performances of these taggers on three South Asian languages with the focus on Bangla using two different tagsets and corpora of different sizes, which reveals that Brill's transformation based tagger performs considerably well for South Asian languages. We also check the baseline performances of the taggers for English and try to conclude how these approaches might perform if we use a considerable amount of annotated training corpus.Fahim Muhammad HasanB. Computer Science and Engineerin

    Part of Speech Tagging of Marathi Text Using Trigram Method

    Get PDF
    In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done
    corecore