3,875 research outputs found

    Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration

    Get PDF
    Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language

    Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR

    Get PDF
    The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light term conation step and useful in case of few language-specific resources. For English, the corpusbased stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR. Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness compared to using a fixed number of terms for different languages

    A Lightweight Stemmer for Gujarati

    Get PDF
    Gujarati is a resource poor language with almost no language processing tools being available. In this paper we have shown an implementation of a rule based stemmer of Gujarati. We have shown the creation of rules for stemming and the richness in morphology that Gujarati possesses. We have also evaluated our results by verifying it with a human expert

    Hindi Sentiment Analysis

    Get PDF
    With the evolution of web technology, a huge amount of data is present on the web. In addition to exploring the resources present on web, the users also provide feedback thus generating additional useful data. Thus mining of data and identifying user sentiments is the need of the hour. Sentiment analysis is the natural language processing task that mines information from various text forms such as blogs, reviews and classify them on the basis of polarity such as positive, negative or neutral. Hindi is the national language of India and is spoken by 366 million people across the world. The percentage of web content in Hindi is growing at lightning speed. A lot of research in opinion mining is carried out in English language but there are not many instances of research done in Hindi language. In this paper we have proposed a strategy for classifying given Hindi texts in to different classes and then extract sentiments in terms of positive, negative and neutral for identified classes. Naive Bayes, Modified Maximum entropy are used for classification and HindiSentiWordNet (HSWN) is used to determine the polarity of individual class
    corecore