114 research outputs found
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism
Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci
The (Undesired) Attenuation of Human Biases by Multilinguality
Some human preferences are universal. The odor of vanilla is perceived as pleasant all around the world. We expect neural models trained on human texts to exhibit these kind of preferences, i.e. biases, but we show that this is not always the case. We explore 16 static and contextual embedding models in 9 languages and, when possible, compare them under similar training conditions. We introduce and release CA-WEAT, multilingual cultural aware tests to quantify biases, and compare them to previous English-centric tests. Our experiments confirm that monolingual static embeddings do exhibit human biases, but values differ across languages, being far from universal. Biases are less evident in contextual models, to the point that the original human association might be reversed. Multilinguality proves to be another variable that attenuates and even reverses the effect of the bias, specially in contextual multilingual models. In order to explain this variance among models and languages, we examine the effect of asymmetries in the training corpus, departures from isomorphism in multilingual embedding spaces and discrepancies in the testing measures between languages
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification
End-to-end neural machine translation has overtaken statistical machine
translation in terms of translation quality for some language pairs, specially
those with large amounts of parallel data. Besides this palpable improvement,
neural networks provide several new properties. A single system can be trained
to translate between many languages at almost no additional cost other than
training time. Furthermore, internal representations learned by the network
serve as a new semantic representation of words -or sentences- which, unlike
standard word embeddings, are learned in an essentially bilingual or even
multilingual context. In view of these properties, the contribution of the
present work is two-fold. First, we systematically study the NMT context
vectors, i.e. output of the encoder, and their power as an interlingua
representation of a sentence. We assess their quality and effectiveness by
measuring similarities across translations, as well as semantically related and
semantically unrelated sentence pairs. Second, as extrinsic evaluation of the
first point, we identify parallel sentences in comparable corpora, obtaining an
F1=98.2% on data from a shared task when using only NMT context vectors. Using
context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure
1949-2019: 70 anni di TA visti attraverso i dati utilizzati
La traduzione automatica (TA) ha subìto diversi mutamenti dal 1940 ad oggi.
Come in molti altri campi dell’informatica e dell’intelligenza artificiale, si è
passati da risorse sviluppate ad hoc manualmente ad approcci basati sempre di
più su dati preesistenti. Il presente contributo si propone di offrire una
panoramica delle diverse architetture di TA e dei dati da esse richiesti, partendo
dagli approcci rule-based e arrivando alle architetture statistiche, examplebased e neurali. Ognuno di questi cambiamenti ha influito sulla tipologia di dati
richiesti per la costruzione di motori di TA. Se i primi approcci non richiedevano
l’utilizzo di frasi allineate, con la TA statistica è diventato imprescindibile poter
fare affidamento su una grande quantità di dati paralleli. Oggi, grazie all’utilizzo
delle reti neurali, è possibile ottenere una traduzione di buona qualità persino
per combinazioni per cui non sono disponibili dati in entrambe le lingue
UniBO @ AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo
We describe our participation in the EVALITA 2020 (Basile et al., 2020) shared task on Automatic Misogyny Identification. We focus on task A —Misogyny and Aggressive Behaviour Identification— which aims at detecting whether a tweet in Italian is misogynous and, if so, whether it is aggressive. Rather than building two different models, one for misogyny and one for aggressiveness identification, we handle the problem as one single multi-label classification task, considering three classes: non-misogynous, non-aggressive misogynous, and aggressive misogynous. Our three-class supervised model, built on top of AlBERTo, obtains an overall F1 score of 0.7438 on the task test set (F1 = 0.8102 for the misogyny and F1 = 0.6774 for the aggressiveness task), which outperforms the top submitted model (F1 = 0.7406)
Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection
We present the shared task on Fine-Grained Propaganda Detection, which was
organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two
subtasks. FLC is a fragment-level task that asks for the identification of
propagandist text fragments in a news article and also for the prediction of
the specific propaganda technique used in each such fragment (18-way
classification task). SLC is a sentence-level binary classification task asking
to detect the sentences that contain propaganda. A total of 12 teams submitted
systems for the FLC task, 25 teams did so for the SLC task, and 14 teams
eventually submitted a system description paper. For both subtasks, most
systems managed to beat the baseline by a sizable margin. The leaderboard and
the data from the competition are available at
http://propaganda.qcri.org/nlp4if-shared-task/.Comment: propaganda, disinformation, fake news. arXiv admin note: text overlap
with arXiv:1910.0251
Wikicardi : hacia la extracción de oraciones paralelas de Wikipedia
Uno de los objetivos del proyecto Tacardi (TIN2012-38523-C02-00) consiste en extraer oraciones paralelas de corpus comparables para enriquecer y adaptar traductores automáticos. En esta investigación usamos un subconjunto de Wikipedia como corpus comparable. En este reporte se describen nuestros avances con respecto a la extracción de fragmentos paralelos de Wikipedia. Primero, discutimos cómo hemos definido los tres dominios de interés -ciencia, informática y deporte-, en el marco de la enciclopedia y cómo hemos extraído los textos y demás datos necesarios para la caracterización de los artículos en las distintas lenguas. Después discutimos brevemente los modelos que usaremos para identificar oraciones paralelas y damos sólo una muestra de algunos resultados preliminares. Los datos obtenidos hasta ahora permiten vislumbran que será posible extraer oraciones paralelas de los dominios de interés a corto plazo, si bien aún no contamos con una estimación del volumen de éstos.Preprin
Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach
Abstract The task of answer selection in community question answering consists of identifying pertinent answers from a pool of user-generated comments related to a question. The recent SemEval-2015 introduced a shared task on community question answering, providing a corpus and evaluation scheme. In this paper we address the problem of answer selection in Arabic. Our proposed model includes a manifold of features including lexical and semantic similarities, vector representations, and rankings. We investigate the contribution of each set of features in a supervised setting. We show that employing a feature combination by means of a linear support vector machine achieves a better performance than that of the competition winner (F 1 of 79.25 compared to 78.55)
Proppy: A System to Unmask Propaganda in Online News
We present proppy, the first publicly available real-world, real-time
propaganda detection system for online news, which aims at raising awareness,
thus potentially limiting the impact of propaganda and helping fight
disinformation. The system constantly monitors a number of news sources,
deduplicates and clusters the news into events, and organizes the articles
about an event on the basis of the likelihood that they contain propagandistic
content. The system is trained on known propaganda sources using a variety of
stylistic features. The evaluation results on a standard dataset show
state-of-the-art results for propaganda detection.Comment: propaganda, disinformation, fake new
- …