604 research outputs found
Investigating the Effects of Word Substitution Errors on Sentence Embeddings
A key initial step in several natural language processing (NLP) tasks
involves embedding phrases of text to vectors of real numbers that preserve
semantic meaning. To that end, several methods have been recently proposed with
impressive results on semantic similarity tasks. However, all of these
approaches assume that perfect transcripts are available when generating the
embeddings. While this is a reasonable assumption for analysis of written text,
it is limiting for analysis of transcribed text. In this paper we investigate
the effects of word substitution errors, such as those coming from automatic
speech recognition errors (ASR), on several state-of-the-art sentence embedding
methods. To do this, we propose a new simulator that allows the experimenter to
induce ASR-plausible word substitution errors in a corpus at a desired word
error rate. We use this simulator to evaluate the robustness of several
sentence embedding methods. Our results show that pre-trained neural sentence
encoders are both robust to ASR errors and perform well on textual similarity
tasks after errors are introduced. Meanwhile, unweighted averages of word
vectors perform well with perfect transcriptions, but their performance
degrades rapidly on textual similarity tasks for text with word substitution
errors.Comment: 4 Pages, 2 figures. Copyright IEEE 2019. Accepted and to appear in
the Proceedings of the 44th International Conference on Acoustics, Speech,
and Signal Processing 2019 (IEEE-ICASSP-2019), May 12-17 in Brighton, U.K.
Personal use of this material is permitted. However, permission to
reprint/republish this material must be obtained from the IEE
Machine Translation for Accessible Multi-Language Text Analysis
English is the international standard of social research, but scholars are
increasingly conscious of their responsibility to meet the need for scholarly
insight into communication processes globally. This tension is as true in
computational methods as any other area, with revolutionary advances in the
tools for English language texts leaving most other languages far behind. In
this paper, we aim to leverage those very advances to demonstrate that
multi-language analysis is currently accessible to all computational scholars.
We show that English-trained measures computed after translation to English
have adequate-to-excellent accuracy compared to source-language measures
computed on original texts. We show this for three major analytics -- sentiment
analysis, topic analysis, and word embeddings -- over 16 languages, including
Spanish, Chinese, Hindi, and Arabic. We validate this claim by comparing
predictions on original language tweets and their backtranslations: double
translations from their source language to English and back to the source
language. Overall, our results suggest that Google Translate, a simple and
widely accessible tool, is effective in preserving semantic content across
languages and methods. Modern machine translation can thus help computational
scholars make more inclusive and general claims about human communication.Comment: 5000 words, 6 figure
Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
In this paper, we present an approach for translating word embeddings from a
majority language into 4 minority languages: Erzya, Moksha, Udmurt and
Komi-Zyrian. Furthermore, we align these word embeddings and present a novel
neural network model that is trained on English data to conduct sentiment
analysis and then applied on endangered language data through the aligned word
embeddings. To test our model, we annotated a small sentiment analysis corpus
for the 4 endangered languages and Finnish. Our method reached at least 56\%
accuracy for each endangered language. The models and the sentiment corpus will
be released together with this paper. Our research shows that state-of-the-art
neural models can be used with endangered languages with the only requirement
being a dictionary between the endangered language and a majority language.Comment: Proceedings of the Second Workshop on Resources and Representations
for Under-Resourced Languages and Domains (RESOURCEFUL-2023
- …