84 research outputs found

    Comparative Analysis of Urdu Based Stemming Techniques

    Get PDF
    Stemming reduces many variant forms of a word into its base, stem or root, which is necessary for many different language processing application including Urdu. Urdu is a morphologically rich and resourceful language. Multilingual Urdu words are very challenging to process due to complexity of morphology. The Research of Urdu stemming has an age of a decade. The present work introduces a research on Urdu stemmers with better performance as compare to the existing Urdu stemmer

    QutNocturnal@HASOC'19: CNN for Hate Speech and Offensive Content Identification in Hindi Language

    Full text link
    We describe our top-team solution to Task 1 for Hindi in the HASOC contest organised by FIRE 2019. The task is to identify hate speech and offensive language in Hindi. More specifically, it is a binary classification problem where a system is required to classify tweets into two classes: (a) \emph{Hate and Offensive (HOF)} and (b) \emph{Not Hate or Offensive (NOT)}. In contrast to the popular idea of pretraining word vectors (a.k.a. word embedding) with a large corpus from a general domain such as Wikipedia, we used a relatively small collection of relevant tweets (i.e. random and sarcasm tweets in Hindi and Hinglish) for pretraining. We trained a Convolutional Neural Network (CNN) on top of the pretrained word vectors. This approach allowed us to be ranked first for this task out of all teams. Our approach could easily be adapted to other applications where the goal is to predict class of a text when the provided context is limited

    A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends

    Get PDF
    Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy

    DCU@FIRE-2014: fuzzy queries with rule-based normalization for mixed script information retrieval

    Get PDF
    We describe the participation of Dublin City University (DCU) in the FIRE-2014 transliteration search task (TST). The TST involves an ad-hoc search over a collection of Hindi film song lyrics. The Hindi language content of each document in the collection is either written in the native Devanagari script or transliterated in Roman script or a combination of both. The queries can be in mixed script as well. The task is challenging primarily because of the vocabulary mismatch which may arise due to the multiple transliteration alternatives. We attempt to address the vocabulary mismatch problem both during the indexing and retrieval stages. During indexing, we apply a rule-based normalization on some character sequences of the transliterated words in order to have a single representation in the index for the multiple transliteration alternatives. During the retrieval phase, we make use of prefix matched fuzzy query terms to account for the morphological variations of the transliterated words. The results show significant improvement over a standard baseline query likelihood language modelling (LM) approach. Additionally, we also apply statistical machine transliteration to train a transliteration model in order to predict the transliteration of out-of-vocabulary words. Surprisingly, even with satisfactory transliteration accuracy, we found that automatic transliteration of query terms degraded retrieval effectiveness

    Importance of Hindi Language and Its Significance in Nation-Building

    Get PDF
    The meaning of the Hindi language is the second most passed on in language on earth, later Mandarin Chinese. It is evaluated that a massive piece of a billion gathering is generally conveyed in this unique vernacular. Importance of Hindi is one of the various tongues in India that is seen as people in general and the official language of India. Indian tunes and modified versions of them have been widely used by various standard rap and famous music-skilled workers across the globe. In India, music gets excellent with jams in the United States, just as with the rest of the world. Modern Hindi is this type of language that has advanced into an utterly OK structure in India later her autonomy and is being used in various areas. Three specific types of Hindi have advanced and performed three various capacities. Today, Hindi is one of India's most critical official languages, with over 1025 million people speaking it worldwide. "In the Indian provinces of Uttar Pradesh, Gujarat, Madhya Pradesh, Bihar, Madhya Pradesh, Chhattisgarh, Jharkhand, Uttarakhand, Himachal Pradesh, Rajasthan, Haryana, and Delhi, Hindi is the Official Language. It is also widely spoken and perceived in several other Indian states, including Punjab, Andhra Pradesh, West Bengal, and Maharashtra". Individuals that relocate to North India from other states are studying Hindi. Sample of 119 respondents was collected from respondents through a "standard questionnaire," which was created on the five-point interval scale

    Bridging Language Gaps in Health Information Access: Konkani-English CLIR System for Medical Knowledge

    Get PDF
    This paper addresses the challenges posed by linguistic diversity in terms of medical information by introducing a Cross-Language Information Retrieval System attuned to the needs of Konkani language information seekers. The proposed system leverages Konkani queries entered by the user, translates them to English, and retrieves the documents using a thesaurus- based approach. Various strategies also have been considered to address the challenges posed by the source language – Konkani which is a minority language spoken in the Indian subcontinent. The proposed approach showcases the potential of combining language technology, information retrieval, and medical domain expertise to bridge linguistic barriers. As healthcare information remains a critical societal need, this work holds promise in facilitating equitable access to medical knowledge

    Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu

    Get PDF
    University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages
    corecore