382 research outputs found

    Sentiment Lexicon Construction Using SentiWordNet 3.0

    Get PDF
    Opinion mining and sentiment analysis have become popular in linguistic resource rich languages. Opinions for such analysis are drawn from many forms of freely available online/electronic sources, such as websites, blogs, news re-ports and product reviews. But attention received by less resourced languages is significantly less. This is because the success of any opinion mining algorithm depends on the availability of resources, such as special lexicon and WordNet type tools. In this research, we implemented a less complicated but an effective approach that could be used to classify comments in less resourced languages. We experimented the approach for use with Sinhala Language where no such opinion mining or sentiment analysis has been carried out until this day. Our algorithm gives significantly promising results for analyzing sentiments in Sinhala for the first time

    Sinhala-English Parallel Word Dictionary Dataset

    Full text link
    Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict

    Survey on Publicly Available Sinhala Natural Language Processing Tools and Research

    Full text link
    Sinhala is the native language of the Sinhalese people who make up the largest ethnic group of Sri Lanka. The language belongs to the globe-spanning language tree, Indo-European. However, due to poverty in both linguistic and economic capital, Sinhala, in the perspective of Natural Language Processing tools and research, remains a resource-poor language which has neither the economic drive its cousin English has nor the sheer push of the law of numbers a language such as Chinese has. A number of research groups from Sri Lanka have noticed this dearth and the resultant dire need for proper tools and research for Sinhala natural language processing. However, due to various reasons, these attempts seem to lack coordination and awareness of each other. The objective of this paper is to fill that gap of a comprehensive literature survey of the publicly available Sinhala natural language tools and research so that the researchers working in this field can better utilize contributions of their peers. As such, we shall be uploading this paper to arXiv and perpetually update it periodically to reflect the advances made in the field

    Identifying False Content and Hate Speech in Sinhala YouTube Videos by Analyzing the Audio

    Full text link
    YouTube faces a global crisis with the dissemination of false information and hate speech. To counter these issues, YouTube has implemented strict rules against uploading content that includes false information or promotes hate speech. While numerous studies have been conducted to reduce offensive English-language content, there's a significant lack of research on Sinhala content. This study aims to address the aforementioned gap by proposing a solution to minimize the spread of violence and misinformation in Sinhala YouTube videos. The approach involves developing a rating system that assesses whether a video contains false information by comparing the title and description with the audio content and evaluating whether the video includes hate speech. The methodology encompasses several steps, including audio extraction using the Pytube library, audio transcription via the fine-tuned Whisper model, hate speech detection employing the distilroberta-base model and a text classification LSTM model, and text summarization through the fine-tuned BART-Large- XSUM model. Notably, the Whisper model achieved a 48.99\% word error rate, while the distilroberta-base model demonstrated an F1 score of 0.856 and a recall value of 0.861 in comparison to the LSTM model, which exhibited signs of overfitting

    Cultural and core borrowings reclassified: A corpus-based study of Sri Lankan English vocabulary

    Get PDF
    World Englishes/ Varieties of English show variation from British English (BrE) through distinct linguistic processes that highlight their uniqueness. Borrowing is one such process that enhances the vocabulary of a distinct English variety used in a particular country due to the effect of the local languages. Literature on borrowing proposes that they can be classified as cultural and core borrowings. This classification encapsulates the reasons for borrowing words from a different language by its users. The term cultural borrowings denote words that are transferred from another language to fill a lexical gap, while the term core borrowings are words that already occur in the language. This paper, a part of an ongoing PhD study, explores whether this binary classification adequately accounts for the types of borrowings found in Sri Lankan English (SLE) recorded in the Sri Lankan component of the International Corpus of English (ICE-SL). The study first extracted a word list using a corpus analysis software, from which the borrowings were manually selected. This was followed by a Google search for the etymology of the words to ascertain the origin of the borrowings that could help to identify whether they filled a lexical gap or duplicated words that already exist. The data indicated that words were borrowed from Sinhala and Tamil, the two official languages of Sri Lanka, as well as other languages. Based on the analysis, this paper proposes that the binary categorization of core and cultural borrowings should be extended to four categories in order to capture the local and regional borrowings that exist within cultural borrowings, as well as to reflect the complexity of meanings identified within core borrowings. KEYWORDS:   Borrowings, core borrowings, cultural borrowings, World Englishes, corpus linguistic
    • …
    corecore