69 research outputs found
Lessons learned from the evaluation of Spanish Language Models
Given the impact of language models on the field of Natural Language
Processing, a number of Spanish encoder-only masked language models (aka BERTs)
have been trained and released. These models were developed either within large
projects using very large private corpora or by means of smaller scale academic
efforts leveraging freely available data. In this paper we present a
comprehensive head-to-head comparison of language models for Spanish with the
following results: (i) Previously ignored multilingual models from large
companies fare better than monolingual models, substantially changing the
evaluation landscape of language models in Spanish; (ii) Results across the
monolingual models are not conclusive, with supposedly smaller and inferior
models performing competitively. Based on these empirical results, we argue for
the need of more research to understand the factors underlying them. In this
sense, the effect of corpus size, quality and pre-training techniques need to
be further investigated to be able to obtain Spanish monolingual models
significantly better than the multilingual ones released by large private
companies, specially in the face of rapid ongoing progress in the field. The
recent activity in the development of language technology for Spanish is to be
welcomed, but our results show that building language models remains an open,
resource-heavy problem which requires to marry resources (monetary and/or
computational) with the best research expertise and practice.Comment: 11 pages, three table
T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks
In the absence of readily available labeled data for a given sequence
labeling task and language, annotation projection has been proposed as one of
the possible strategies to automatically generate annotated data. Annotation
projection has often been formulated as the task of transporting, on parallel
corpora, the labels pertaining to a given span in the source language into its
corresponding span in the target language. In this paper we present
T-Projection, a novel approach for annotation projection that leverages large
pretrained text-to-text language models and state-of-the-art machine
translation technology. T-Projection decomposes the label projection task into
two subtasks: (i) A candidate generation step, in which a set of projection
candidates using a multilingual T5 model is generated and, (ii) a candidate
selection step, in which the generated candidates are ranked based on
translation probabilities. We conducted experiments on intrinsic and extrinsic
tasks in 5 Indo-European and 8 low-resource African languages. We demostrate
that T-projection outperforms previous annotation projection methods by a wide
margin. We believe that T-Projection can help to automatically alleviate the
lack of high-quality training data for sequence labeling tasks. Code and data
are publicly available.Comment: Findings of the EMNLP 202
Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings
Zero-resource cross-lingual transfer approaches aim to apply supervised
models from a source language to unlabelled target languages. In this paper we
perform an in-depth study of the two main techniques employed so far for
cross-lingual zero-resource sequence labelling, based either on data or model
transfer. Although previous research has proposed translation and annotation
projection (data-based cross-lingual transfer) as an effective technique for
cross-lingual sequence labelling, in this paper we experimentally demonstrate
that high capacity multilingual language models applied in a zero-shot
(model-based cross-lingual transfer) setting consistently outperform data-based
cross-lingual transfer approaches. A detailed analysis of our results suggests
that this might be due to important differences in language use. More
specifically, machine translation often generates a textual signal which is
different to what the models are exposed to when using gold standard data,
which affects both the fine-tuning and evaluation processes. Our results also
indicate that data-based cross-lingual transfer approaches remain a competitive
option when high-capacity multilingual language models are not available.Comment: Findings of the Association for Computational Linguistics: EMNLP 202
Applying Deep Learning Techniques for Sentiment Analysis to Assess Sustainable Transport
Users voluntarily generate large amounts of textual content by expressing their opinions, in social media and specialized portals, on every possible issue, including transport and sustainability. In this work we have leveraged such User Generated Content to obtain a high accuracy sentiment analysis model which automatically analyses the negative and positive opinions expressed in the transport domain. In order to develop such model, we have semiautomatically generated an annotated corpus of opinions about transport, which has then been used to fine-tune a large pretrained language model based on recent deep learning techniques. Our empirical results demonstrate the robustness of our approach, which can be applied to automatically process massive amounts of opinions about transport. We believe that our method can help to complement data from official statistics and traditional surveys about transport sustainability. Finally, apart from the model and annotated dataset, we also provide a transport classification score with respect to the sustainability of the transport types found in the use case dataset.This work has been partially funded by the Spanish Ministry of Science, Innovation and Universities (DeepReading RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE), Ayudas Fundación BBVA a Equipos de Investigación Científica 2018 (BigKnowledge), DeepText (KK-2020/00088), funded by the Basque Government and the COLAB19/19 project funded by the UPV/EHU. Rodrigo Agerri is also funded by the RYC-2017-23647 fellowship and acknowledges the donation of a Titan V GPU by the NVIDIA Corporation
DeepReading @ SardiStance: Combining Textual, Social and Emotional Features
In this paper we describe our participation to the SardiStance shared task held at EVALITA 2020. We developed a set of classifiers that combined text features, such as the best performing systems based on large pre-trained language models, together with user profile features, such as psychological traits and social media user interactions. The classification algorithms chosen for our models were various monolingual and multilingual Transformer models for text only classification, and XGBoost for the non-textual features. The combination of the textual and contextual models was performed by a weighted voting ensemble learning system. Our approach obtained the best score for Task B, on Contextual Stance Detection
- …