490 research outputs found

    Improve and Implement an Open Source Question Answering System

    Get PDF
    A question answer system takes queries from the user in natural language and returns a short concise answer which best fits the response to the question. This report discusses the integration and implementation of question answer systems for English and Hindi as part of the open source search engine Yioop. We have implemented a question answer system for English and Hindi, keeping in mind users who use these languages as their primary language. The user should be able to query a set of documents and should get the answers in the same language. English and Hindi are very different when it comes to language structure, characters etc. We have implemented the Question Answer System so that it supports localization and improved Part of Speech tagging performance by storing the lexicon in the database instead of a file based lexicon. We have implemented a brill tagger variant for Part of Speech tagging of Hindi phrases and grammar rules for triplet extraction. We also improve Yioop’s lexical data handling support by allowing the user to add named entities. Our improvements to Yioop were then evaluated by comparing the retrieved answers against a dataset of answers known to be true. The test data for the question answering system included creating 2 indexes, 1 each for English and Hindi. These were created by configuring Yioop to crawl 200,000 wikipedia pages for each crawl. The crawls were configured to be domain specific so that English index consists of pages restricted to English text and Hindi index is restricted to pages with Hindi text. We then used a set of 50 questions on the English and Hindi systems. We recored, Hindi system to have an accuracy of about 55% for simple factoid questions and English question answer system to have an accuracy of 63%

    A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-of-Speech Tagging

    Full text link
    In this paper, we propose a new approach to construct a system of transformation rules for the Part-of-Speech (POS) tagging task. Our approach is based on an incremental knowledge acquisition method where rules are stored in an exception structure and new rules are only added to correct the errors of existing rules; thus allowing systematic control of the interaction between the rules. Experimental results on 13 languages show that our approach is fast in terms of training time and tagging speed. Furthermore, our approach obtains very competitive accuracy in comparison to state-of-the-art POS and morphological taggers.Comment: Version 1: 13 pages. Version 2: Submitted to AI Communications - the European Journal on Artificial Intelligence. Version 3: Resubmitted after major revisions. Version 4: Resubmitted after minor revisions. Version 5: to appear in AI Communications (accepted for publication on 3/12/2015

    Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration

    Get PDF
    Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language

    Part of Speech Tagging of Marathi Text Using Trigram Method

    Get PDF
    In this paper we present a Marathi part of speech tagger. It is a morphologically rich language. It is spoken by the native people of Maharashtra. The general approach used for development of tagger is statistical using trigram Method. The main concept of trigram is to explore the most likely POS for a token based on given information of previous two tags by calculating probabilities to determine which is the best sequence of a tag. In this paper we show the development of the tagger. Moreover we have also shown the evaluation done

    Part-Of-Speech Tagging Of Urdu in Limited Resources Scenario

    Get PDF
    We address the problem of Part-of-Speech (POS) tagging of Urdu. POS tagging is the process of assigning a part-of-speech or lexical class marker to each word in the given text. Tagging for natural languages is similar to tokenization and lexical analysis for computer languages, except that we encounter ambiguities which are to be resolved. It plays a fundamental role in various Natural Language Processing (NLP) applications such as word sense disambiguation, parsing, name entity recognition and chunking. POS tagging, particularly plays very important role in processing free-word-order languages because such languages have relatively complex morphological structure. Urdu is a morphologically rich language. Forms of the verb, as well as case, gender, and number are expressed by the morphology. It shares its morphology, phonology and grammatical structures with Hindi. It shares its vocabulary with Arabic, Persian, Sanskrit, Turkish and Pashto languages. Urdu is written using the Perso-Arabic script. POS tagging of Urdu is a necessary component for most NLP applications of Urdu. Development of an Urdu POS tagger will influence several pipelined modules of natural language understanding system, including machine translation; partial parsing and word sense disambiguation. Our objective is to develop a robust POS tagger for Urdu. We have worked on the automatic annotation of part-of-speech for Urdu. We have defined a tag-set for Urdu. We manually annotated a corpus of 10,000 sentences. We have used different machine learning methods, namely Hidden Markov Model (HMM), Maximum Entropy Model (ME) and Conditional Random Field (CRF). Further, to deal with a small-annotated corpus, we explored the use of semi-supervised learning by using an additional un-annotated corpus. We also explored the use of a dictionary to provide to us all possible POS labeling for a given word. Since Urdu is morphologically productive. Hence we augmented Hidden Markov Model, Maximum Entropy Model and Conditional Random Field with morphological features, word suffixes and POS categories of words to develop robust POS tagger for Urdu in the limited resources scenario
    • …
    corecore