1,876 research outputs found
On the Reproducibility and Generalisation of the Linear Transformation of Word Embeddings
Linear transformation is a way to learn a linear relationship between two word embeddings, such that words in the two different embedding spaces can be semantically related. In this paper, we examine the reproducibility and generalisation of the linear transformation of word embeddings. Linear transformation is particularly useful when translating word embedding models in different languages, since it can capture the semantic relationships between two models. We first reproduce two linear transformation approaches, a recent one using orthogonal transformation and the original one using simple matrix transformation. Previous findings on a machine translation task are re-examined, validating that linear transformation is indeed an effective way to transform word embedding models in different languages. In particular, we show that the orthogonal transformation can better relate the different embedding models. Following the verification of previous findings, we then study the generalisation of linear transformation in a multi-language Twitter election classification task. We observe that the orthogonal transformation outperforms the matrix transformation. In particular, it significantly outperforms the random classifier by at least 10% under the F1 metric across English and Spanish datasets. In addition, we also provide best practices when using linear transformation for multi-language Twitter election classification
Multilingual twitter sentiment analysis using machine learning
Twitter sentiment analysis is one of the leading research fields. Most of the researchers were contributed to twitter sentiment analysis in English tweets, but few researchers focus on the multilingual twitter sentiment analysis. Some challenges are hoping for the research solutions in multilingual twitter sentiment analysis. This study presents the implementation of sentiment analysis in multilingual twitter data and improves the data classification up to the adequate level of accuracy. Twitter is the sixth leading social networking site in the world. Active users for twitter in a month are 330 million. People can tweet or re-tweet in their languages and allow users to use emoji’s, abbreviations, contraction words, miss spellings, and shortcut words. The best platform for sentiment analysis is twitter. Multilingual tweets and data sparsity are the two main challenges. In this paper, the MLTSA algorithm gives the solution for these two challenges. MLTSA algorithm divides into two parts. One is detecting and translating non-English tweets into English using natural language processing (NLP). And the second one is an appropriate pre-processing method with NLP support can reduce the data sparsity. The result of the MLTSA with SVM achieves good accuracy by up to 95%
Improving Sentiment Analysis over non-English Tweets using Multilingual Transformers and Automatic Translation for Data-Augmentation
Tweets are specific text data when compared to general text. Although
sentiment analysis over tweets has become very popular in the last decade for
English, it is still difficult to find huge annotated corpora for non-English
languages. The recent rise of the transformer models in Natural Language
Processing allows to achieve unparalleled performances in many tasks, but these
models need a consequent quantity of text to adapt to the tweet domain. We
propose the use of a multilingual transformer model, that we pre-train over
English tweets and apply data-augmentation using automatic translation to adapt
the model to non-English languages. Our experiments in French, Spanish, German
and Italian suggest that the proposed technique is an efficient way to improve
the results of the transformers over small corpora of tweets in a non-English
language.Comment: Accepted to COLING202
Machine Translation for Accessible Multi-Language Text Analysis
English is the international standard of social research, but scholars are
increasingly conscious of their responsibility to meet the need for scholarly
insight into communication processes globally. This tension is as true in
computational methods as any other area, with revolutionary advances in the
tools for English language texts leaving most other languages far behind. In
this paper, we aim to leverage those very advances to demonstrate that
multi-language analysis is currently accessible to all computational scholars.
We show that English-trained measures computed after translation to English
have adequate-to-excellent accuracy compared to source-language measures
computed on original texts. We show this for three major analytics -- sentiment
analysis, topic analysis, and word embeddings -- over 16 languages, including
Spanish, Chinese, Hindi, and Arabic. We validate this claim by comparing
predictions on original language tweets and their backtranslations: double
translations from their source language to English and back to the source
language. Overall, our results suggest that Google Translate, a simple and
widely accessible tool, is effective in preserving semantic content across
languages and methods. Modern machine translation can thus help computational
scholars make more inclusive and general claims about human communication.Comment: 5000 words, 6 figure
Cross-Lingual Classification of Crisis Data
Many citizens nowadays flock to social media during crises to share or acquire the latest information about the event. Due to the sheer volume of data typically circulated during such events, it is necessary to be able to efficiently filter out irrelevant posts, thus focusing attention on the posts that are truly relevant to the crisis. Current methods for classifying the relevance of posts to a crisis or set of crises typically struggle to deal with posts in different languages, and it is not viable during rapidly evolving crisis situations to train new models for each language. In this paper we test statistical and semantic classification approaches on cross-lingual datasets from 30 crisis events, consisting of posts written mainly in English, Spanish, and Italian. We experiment with scenarios where the model is trained on one language and tested on another, and where the data is translated to a single language. We show that the addition of semantic features extracted from external knowledge bases improve accuracy over a purely statistical model
- …