804 research outputs found

    A Comprehensive Review of Sentiment Analysis on Indian Regional Languages: Techniques, Challenges, and Trends

    Get PDF
    Sentiment analysis (SA) is the process of understanding emotion within a text. It helps identify the opinion, attitude, and tone of a text categorizing it into positive, negative, or neutral. SA is frequently used today as more and more people get a chance to put out their thoughts due to the advent of social media. Sentiment analysis benefits industries around the globe, like finance, advertising, marketing, travel, hospitality, etc. Although the majority of work done in this field is on global languages like English, in recent years, the importance of SA in local languages has also been widely recognized. This has led to considerable research in the analysis of Indian regional languages. This paper comprehensively reviews SA in the following major Indian Regional languages: Marathi, Hindi, Tamil, Telugu, Malayalam, Bengali, Gujarati, and Urdu. Furthermore, this paper presents techniques, challenges, findings, recent research trends, and future scope for enhancing results accuracy

    The TXM Portal Software giving access to Old French Manuscripts Online

    Get PDF
    Texte intégral en ligne : http://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdfInternational audiencehttp://www.lrec-conf.org/proceedings/lrec2012/workshops/13.ProceedingsCultHeritage.pdf This paper presents the new TXM software platform giving online access to Old French Text Manuscripts images and tagged transcriptions for concordancing and text mining. This platform is able to import medieval sources encoded in XML according to the TEI Guidelines for linking manuscript images to transcriptions, encode several diplomatic levels of transcription including abbreviations and word level corrections. It includes a sophisticated tokenizer able to deal with TEI tags at different levels of linguistic hierarchy. Words are tagged on the fly during the import process using IMS TreeTagger tool with a specific language model. Synoptic editions displaying side by side manuscript images and text transcriptions are automatically produced during the import process. Texts are organized in a corpus with their own metadata (title, author, date, genre, etc.) and several word properties indexes are produced for the CQP search engine to allow efficient word patterns search to build different type of frequency lists or concordances. For syntactically annotated texts, special indexes are produced for the Tiger Search engine to allow efficient syntactic concordances building. The platform has also been tested on classical Latin, ancient Greek, Old Slavonic and Old Hieroglyphic Egyptian corpora (including various types of encoding and annotations)

    Hierarchical classification for Multilingual Language Identification and Named Entity Recognition

    Get PDF
    ABSTRACT This paper describes the approach for Subtask-1 of the FIRE-2015 Shared Task on Mixed Script Information Retrieval. The subtask involved multilingual language identification (including mixed words and anomalous foreign words), named entity recognition (NER) and subclassification. The proposed methodology starts with cleaning the data and then extracting structural and contextual features from the text for further processing. A subset of these features is selected (based on validation) for training supervised classifiers, separately for language identification and NER. Finally, they are applied hierarchically to annotate the entire text. The detected named entities are further subclassified by a novel unsupervised technique based on query refinement and keyword based scoring. The proposed approach on the testing dataset of the shared task showed promising results with a weighed F-measure of 0.8082. However, it is worth noting that the classifiers have been sub-optimal with respect to discriminating between certain linguistically similar languages (for e.g., Gujarati in Hindi and Gujarati pairs). The proposed approach is flexible and robust enough to handle additional languages for identification as well as anomalous foreign or extraneous words. The implementation of the approach has also been shared for the purpose of future research usage

    NELIS -Named Entity and Language Identification System: Shared Task System Description

    Get PDF
    ABSTRACT This paper proposes a simple and elegant solution for language identification and named entity (NE) recognition at a word level, as a part of Subtask-1: Query Word Labeling of FIRE 2015. Given any query q 1 :w 1 w 2 w 3 … w n in Roman script, the task calls for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}. The approach presented in this paper uses the combination of a dictionary lookup with a Naïve Bayes classifier trained over character n-grams. Also, we devise an algorithm to resolve ambiguities between languages, for any given word in a query. Our system achieved impressive f-measure scores of 85-90% in four languages and 74-80% in another four languages

    Tradition and Technology: A Design-Based Prototype of an Online Ginan Semantization Tool

    Get PDF
    The heritage of ginans of the Nizari Ismaili community comprises over 1,000 individual hymn-like poems of varying lengths and languages. The ginans were originally composed to spread the teachings of the Satpanth Ismaili faith and served as scriptural texts that guided the normative understanding of the community in South Asia. The emotive melodies of the ginans continue to enchant the members of the community in the diaspora who do not necessarily understand the language of the ginans. The language of the ginans is mixed and borrows vocabulary from Indo-Aryan and Perso-Arabic dialects. With deliberate and purposeful use of information technology, the online tool blends the Western best practices of language learning with the traditional transmission methods and materials of the Ismaili community. This study is based on the premise that for the teachings of the ginans to survive in the Euro-American diaspora, the successive generations must learn and understand the vocabulary of the ginans. The process through which humans learn and master vocabulary is called semantization, which refers to the process of learning and understand various senses and uses of words in a language. To this end, a sample ginan corpus was chosen and semantically analyzed to develop an online ginan lexicon. This lexicon was then used to enrich ginan texts with online glosses to facilitate semantization of ginan vocabulary. The design based-research methodology for prototyping the tool comprised two design iterations of analysis, design, and review. In the first iteration, the initial design of the prototype was based on the multidisciplinary literature review and an in-depth semantic analysis of ginan materials. The initial design was then reviewed by community ginan experts and teachers to inform the next design iteration. In the second design iteration, the initial design was enhanced into a functional prototype by adding features based on the expert suggestions as well as the needs of community learners gathered by surveying a convenience sample of 515 community members across the globe. The analysis of the survey data revealed that over 90% of the survey participants preferred English materials for learning and understanding the language of the ginans. In addition, having online access to ginan materials was expressed as a dire need for the community to engage with the ginans. The development and dissemination of curriculum-based educational programs and supporting resources for the ginans emerged as the most urgent and unmet expectations of the community. The study also confirmed that the wide availability of an online ginan learning tool, such as the one designed in this study, is highly desirable by English-speaking community members who want to learn and understand the tradition and teachings of ginans. However, such a tool is only a part of the solution for fostering sustainable community engagement for the preservation of ginans. To ensure that the tradition is carried forward by the future generations with compassion and understanding, the community institutions must make ginans an educational priority and ensure educational resources for ginans are widely available to community members

    Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

    Get PDF
    No abstract available
    corecore