2,882 research outputs found
Opinion-Mining on Marglish and Devanagari Comments of YouTube Cookery Channels Using Parametric and Non-Parametric Learning Models
YouTube is a boon, and through it people can educate, entertain, and express themselves about various topics. YouTube India currently has millions of active users. As there are millions of active users it can be understood that the data present on the YouTube will be large. With India being a very diverse country, many people are multilingual. People express their opinions in a code-mix form. Code-mix form is the mixing of two or more languages. It has become a necessity to perform Sentiment Analysis on the code-mix languages as there is not much research on Indian code-mix language data. In this paper, Sentiment Analysis (SA) is carried out on the Marglish (Marathi + English) as well as Devanagari Marathi comments which are extracted from the YouTube API from top Marathi channels. Several machine-learning models are applied on the dataset along with 3 different vectorizing techniques. Multilayer Perceptron (MLP) with Count vectorizer provides the best accuracy of 62.68% on the Marglish dataset and Bernoulli Naïve Bayes along with the Count vectorizer, which gives accuracy of 60.60% on the Devanagari dataset. Multilayer Perceptron and Bernoulli Naïve Bayes are considered to be the best performing algorithms. 10-fold cross-validation and statistical testing was also carried out on the dataset to confirm the results
Crisis translation: considering language needs in multilingual disaster settings
Purpose: The purpose of this conceptual paper is to highlight the role that language translation can play in disaster prevention and management and to make the case for increased attention to language translation in crisis communication.
Approach: The article draws on literature relating to disaster management to suggest that translation is a perennial issue in crisis communication.
Findings: Although communication with multicultural and multilinguistic communities is seen as being in urgent need of attention, we find that the role of translation in enabling this is underestimated, if not unrecognised.
Value: This article raises awareness of the need for urgent attention to be given by scholars and practitioners to the role of translation in crisis communication
FinTech, blockchain and Islamic finance : an extensive literature review
Purpose: The paper aims to review the academic research work done in the area of Islamic financial technology. The Islamic FinTech area has been classified into three broad categories of the Islamic FinTech, Islamic Financial technology opportunities and challenges, Cryptocurrency/Blockchain sharia compliance and law/regulation. Finally, the study identifies and highlights the opportunities and challenges that Islamic Financial institutions can learn from the conventional FinTech organization across the world. Approach/Methodology/Design: The study collected 133 research studies (50 from Social Science Research Network (SSRN), 30 from Research gate, 33 from Google Scholar and 20 from other sources) in the area of Islamic Financial Technology. The study presents the systematic review of the above studies. Findings: The study classifies the Islamic FinTech into three broad categories namely, Islamic FinTech opportunities and challenges, Cryptocurrency/Blockchain sharia compliance and law/regulation. The study identifies that the sharia compliance related to the cryptocurrency/Blockchain is the biggest challenge which Islamic FinTech organizations are facing. During our review we also find that Islamic FinTech organizations are to be considered as partners by the Islamic Financial Institutions (IFI’s) than the competitors. If Islamic Financial institutions want to increase efficiency, transparency and customer satisfaction they have to adopt FinTech and become partners with the FinTech companies. Practical Implications: The study will contribute positively to the understanding of Islamic Fintech for the academia, industry, regulators, investors and other FinTech users. Originality/Value: The study believes to contribute positively to understanding of Fintech based technology like cryptocurrency/Blockchain from sharia perspective.peer-reviewe
Proceedings of the 17th Annual Conference of the European Association for Machine Translation
Proceedings of the 17th Annual Conference of the European Association for Machine Translation (EAMT
Probing Multilingual BERT for Genetic and Typological Signals
We probe the layers in multilingual BERT (mBERT) for phylogenetic and
geographic language signals across 100 languages and compute language distances
based on the mBERT representations. We 1) employ the language distances to
infer and evaluate language trees, finding that they are close to the reference
family tree in terms of quartet tree distance, 2) perform distance matrix
regression analysis, finding that the language distances can be best explained
by phylogenetic and worst by structural factors and 3) present a novel measure
for measuring diachronic meaning stability (based on cross-lingual
representation variability) which correlates significantly with published
ranked lists based on linguistic approaches. Our results contribute to the
nascent field of typological interpretability of cross-lingual text
representations.Comment: COLING 202
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
This paper describes the development of a multilingual, manually annotated
dataset for three under-resourced Dravidian languages generated from social
media comments. The dataset was annotated for sentiment analysis and offensive
language identification for a total of more than 60,000 YouTube comments. The
dataset consists of around 44,000 comments in Tamil-English, around 7,000
comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high
inter-annotator agreement in Krippendorff's alpha. The dataset contains all
types of code-mixing phenomena since it comprises user-generated content from a
multilingual country. We also present baseline experiments to establish
benchmarks on the dataset using machine learning methods. The dataset is
available on Github
(https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo
(https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page
Layer or representation space: what makes BERT-based evaluation metrics robust?
The evaluation of recent embedding-based evaluation metrics for text generation is primarily based on measuring their correlation with human evaluations on standard benchmarks. However, these benchmarks are mostly from similar domains to those used for pretraining word embeddings. This raises concerns about the (lack of) generalization of embedding-based metrics to new and noisy domains that contain a different vocabulary than the pretraining data. In this paper, we examine the robustness of BERTScore, one of the most popular embedding-based metrics for text generation. We show that (a) an embedding-based metric that has the highest correlation with human evaluations on a standard benchmark can have the lowest correlation if the amount of input noise or unknown tokens increases, (b) taking embeddings from the first layer of pretrained models improves the robustness of all metrics, and (c) the highest robustness is achieved when using character-level embeddings, instead of token-based embeddings, from the first layer of the pretrained model
- …