216 research outputs found

    Transliteration Based Text Input Methods for Telugu

    Full text link

    Comparison of Different Orthographies for Machine Translation of Under-Resourced Dravidian Languages

    Get PDF
    Under-resourced languages are a significant challenge for statistical approaches to machine translation, and recently it has been shown that the usage of training data from closely-related languages can improve machine translation quality of these languages. While languages within the same language family share many properties, many under-resourced languages are written in their own native script, which makes taking advantage of these language similarities difficult. In this paper, we propose to alleviate the problem of different scripts by transcribing the native script into common representation i.e. the Latin script or the International Phonetic Alphabet (IPA). In particular, we compare the difference between coarse-grained transliteration to the Latin script and fine-grained IPA transliteration. We performed experiments on the language pairs English-Tamil, English-Telugu, and English-Kannada translation task. Our results show improvements in terms of the BLEU, METEOR and chrF scores from transliteration and we find that the transliteration into the Latin script outperforms the fine-grained IPA transcription

    Nlp Challenges for Machine Translation from English to Indian Languages

    Get PDF
    This Natural Langauge processing is carried particularly on English-Kannada/Telugu. Kannada is a language of India. The Kannada language has a classification of Dravidian, Southern, Tamil-Kannada, and Kannada. Regions Spoken: Kannada is also spoken in Karnataka, Andhra Pradesh, Tamil Nadu, and Maharashtra. Population: The total population of people who speak Kannada is 35,346,000, as of 1997. Alternate Name: Other names for Kannada are Kanarese, Canarese, Banglori, and Madrassi. Dialects: Some dialects of Kannada are Bijapur, Jeinu Kuruba, and Aine Kuruba. There are about 20 dialects and Badaga may be one. Kannada is the state language of Karnataka. About 9,000,000 people speak Kannada as a second language. The literacy rate for people who speak Kannada as a first language is about 60%, which is the same for those who speak Kannada as a second language (in India). Kannada was used in the Bible from 1831-2000. Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example-based machine translatio

    YIOOP! INTRODUCING AUTOSUGGEST AND SPELL CHECK

    Get PDF
    This project adds autosuggest and spell-check for queries in Yioop [1], a PHP- based search engine. These features help a user by reducing typing, by catching any spelling errors, and by making it easier to repeat searches. Commercial search engines like Google, run on machine clusters and use lists of popular queries from their logs to provide relevant suggestions to users. Efficient storage of data on multiple servers is responsible for minimizing response times. Yioop typically runs on a smaller number of machines compared to commercial search engines. This project aims to implement these computationally intensive functionalities in this constrained environment. This is achieved by performing any needed processing on the client-side without sending queries to the Yioop server

    A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends

    Get PDF
    Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy

    Bridging Language Gaps in Health Information Access: Konkani-English CLIR System for Medical Knowledge

    Get PDF
    This paper addresses the challenges posed by linguistic diversity in terms of medical information by introducing a Cross-Language Information Retrieval System attuned to the needs of Konkani language information seekers. The proposed system leverages Konkani queries entered by the user, translates them to English, and retrieves the documents using a thesaurus- based approach. Various strategies also have been considered to address the challenges posed by the source language – Konkani which is a minority language spoken in the Indian subcontinent. The proposed approach showcases the potential of combining language technology, information retrieval, and medical domain expertise to bridge linguistic barriers. As healthcare information remains a critical societal need, this work holds promise in facilitating equitable access to medical knowledge

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Unsupervised Machine Translation On Dravidian Languages

    Get PDF
    Unsupervised neural machine translation (UNMT) is beneficial especially for low resource languages such as those from the Dravidian family. However, UNMT systems tend to fail in realistic scenarios involving actual low resource languages. Recent works propose to utilize auxiliary parallel data and have achieved state-of-the-art results. In this work, we focus on unsupervised translation between English and Kannada, a low resource Dravidian language. We additionally utilize a limited amount of auxiliary data between English and other related Dravidian languages. We show that unifying the writing systems is essential in unsupervised translation between the Dravidian languages. We explore several model architectures that use the auxiliary data in order to maximize knowledge sharing and enable UNMT for distant language pairs. Our experiments demonstrate that it is crucial to include auxiliary languages that are similar to our focal language, Kannada. Furthermore, we propose a metric to measure language similarity and show that it serves as a good indicator for selecting the auxiliary languages
    • …
    corecore