499 research outputs found
Building a Sentiment Corpus of Tweets in Brazilian Portuguese
The large amount of data available in social media, forums and websites
motivates researches in several areas of Natural Language Processing, such as
sentiment analysis. The popularity of the area due to its subjective and
semantic characteristics motivates research on novel methods and approaches for
classification. Hence, there is a high demand for datasets on different domains
and different languages. This paper introduces TweetSentBR, a sentiment corpora
for Brazilian Portuguese manually annotated with 15.000 sentences on TV show
domain. The sentences were labeled in three classes (positive, neutral and
negative) by seven annotators, following literature guidelines for ensuring
reliability on the annotation. We also ran baseline experiments on polarity
classification using three machine learning methods, reaching 80.99% on
F-Measure and 82.06% on accuracy in binary classification, and 59.85% F-Measure
and 64.62% on accuracy on three point classification.Comment: Accepted for publication in 11th International Conference on Language
Resources and Evaluation (LREC 2018
Towards Syntactic Iberian Polarity Classification
Lexicon-based methods using syntactic rules for polarity classification rely
on parsers that are dependent on the language and on treebank guidelines. Thus,
rules are also dependent and require adaptation, especially in multilingual
scenarios. We tackle this challenge in the context of the Iberian Peninsula,
releasing the first symbolic syntax-based Iberian system with rules shared
across five official languages: Basque, Catalan, Galician, Portuguese and
Spanish. The model is made available.Comment: 7 pages, 5 tables. Contribution to the 8th Workshop on Computational
Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA-2017)
at EMNLP 201
Using geolocated tweets for characterization of Twitter in Portugal and the Portuguese administrative regions
The information published by the millions of public social network users is an important source of knowledge that can be used in academic, socioeconomic or demographic studies (distribution of male and female population, age, marital status, birth), lifestyle analysis (interests, hobbies, social habits) or be used to study online behavior (time spent online, interaction with friends or discussion about brands, products or politics). This work uses a database of about 27 million Portuguese geolocated tweets, produced in Portugal by 97.8 K users during a 1-year period, to extract information about the behavior of the geolocated Portuguese Twitter community and show that with this information it is possible to extract overall indicators such as: the daily periods of increased activity per region; prediction of regions where the concentration of the population is higher or lower in certain periods of the year; how do regional habitants feel about life; or what is talked about in each region. We also analyze the behavior of the geolocated Portuguese Twitter users based on the tweeted contents, and find indications that their behavior differs in certain relevant aspect from other Twitter communities, hypothesizing that this is in part due to the abnormal high percentage of young teenagers in the community. Finally, we present a small case study on Portuguese tourism in the Algarve region. To the best of our knowledge, this work is the first study that shows geolocated Portuguese users' behavior in Twitter focusing on geographic regional use.info:eu-repo/semantics/acceptedVersio
Recognizing Emotions in Short Texts
Tese de mestrado, Ciência Cognitiva, Universidade de Lisboa, Faculdade de Ciências, 2022O reconhecimento automático de emoções em texto é uma tarefa que mobiliza as áreas de processamento
de linguagem natural e de computação afetiva, para as quais se pode contar com o especial contributo
de disciplinas da Ciência Cognitiva como Inteligência Artificial e Ciência da Computação, Linguística
e Psicologia. Visa, sobretudo, a deteção e interpretação de emoções humanas através da sua expressão
na forma escrita por sistemas computacionais.
A interação entre processos afetivos e cognitivos, o papel essencial que as emoções
desempenham nas interações interpessoais e a crescente utilização de comunicação escrita online nos
dias de hoje fazem com que o reconhecimento de emoções de forma automática seja cada vez mais
importante, nomeadamente em áreas como saúde mental, interação pessoa-computador, ciência política
ou marketing.
A língua inglesa tem sido o maior alvo de estudo no que diz respeito ao reconhecimento de
emoções em textos, sendo que ainda existe pouco trabalho desenvolvido para a língua portuguesa.
Assim, existe uma necessidade em expandir o trabalho feito para a língua inglesa para o português.
Esta dissertação tem como objetivo a comparação de dois métodos distintos de aprendizagem
profunda resultantes dos avanços na área de Inteligência Artificial para detetar e classificar de forma
automática estados emocionais discretos em textos escritos em língua portuguesa.
Para tal, a abordagem de classificação de Polignano et al. (2019) baseada em redes de
aprendizagem profunda como Long Short-Term Memory bidirecionais e redes convolucionais mediadas
por um mecanismo de atenção será replicada para a língua inglesa e será reproduzida para a língua
portuguesa. Para a língua inglesa, será utilizado o conjunto de dados da tarefa 1 do SemEval-2018
(Mohammad et al., 2018) tal como na experiência original, que considera quatro emoções discretas:
raiva, medo, alegria e tristeza. Para a língua portuguesa, tendo em consideração a falta de conjuntos de
dados disponíveis anotados relativamente a emoções, será efetuada uma recolha de dados a partir da
rede social Twitter recorrendo a hashtags com conteúdo associado a uma emoção específica para
determinar a emoção subjacente ao texto de entre as mesmas quatro emoções presentes no conjunto de
dados da língua inglesa que será utilizado. De acordo com experiências realizadas por Mohammad &
Kiritchenko (2015), este método de recolha de dados é consistente com a anotação de juízes humanos
treinados.
Tendo em conta a rápida e contínua evolução dos métodos de aprendizagem profunda para o
processamento de linguagem natural e o estado da arte estabelecido por métodos recentes em tarefas
desta área tal como o modelo pré-treinado BERT (Bidirectional Encoder Representations from
Tranformers) (Devlin et al., 2019), será também aplicada esta abordagem para a tarefa de
reconhecimento de emoções para as duas línguas em questão, utilizando os mesmos conjuntos de dados
das experiências anteriores.
Enquanto a abordagem de Polignano et al. teve um melhor desempenho nas experiências que
realizámos com dados em inglês, com diferenças de F1-score de 0.02, o melhor resultado obtido nas
experiências com dados na língua portuguesa foi com o modelo BERT, obtendo um resultado máximo
de F1-score de 0.6124.Automatic emotion recognition from text is a task that mobilizes the areas of natural language processing
and affective computing counting with the special contribution of Cognitive Science subjects such as
Artificial Intelligence and Computer Science, Linguistics and Psychology. It aims at the detection and
interpretation of human emotions expressed in the written form by computational systems.
The interaction of affective and cognitive processes, the essential role that emotions play in
interpersonal interactions and the currently increasing use of written communication online make
automatic emotion recognition progressively important, namely in areas such as mental healthcare,
human-computer interaction, political science, or marketing.
The English language has been the main target of studies in emotion recognition in text and the
work developed for the Portuguese language is still scarce. Thus, there is a need to expand the work
developed for English to Portuguese.
The goal of this dissertation is to present and compare two distinct deep learning methods
resulting from the advances in Artificial Intelligence to automatically detect and classify discrete
emotional states in texts written in Portuguese.
For this, the classification approach of Polignano et al. (2019) based on deep learning networks
such as bidirectional Long Short-Term Memory and convolutional networks mediated by a self-attention
level will be replicated for English and it will be reproduced for Portuguese. For English, the
SemEval-2018 task 1 dataset (Mohammad et al., 2018) will be used, as in the original experience, and
it considers four discrete emotions: anger, fear, joy, and sadness. For Portuguese, considering the lack
of available emotionally annotated datasets, data will be collected from the social network Twitter using
hashtags associated to a specific emotional content to determine the underlying emotion of the text from
the same four emotions present in the English dataset. According to experiments carried out by
Mohammad & Kiritchenko (2015), this method of data collection is consistent with the annotation of
trained human judges.
Considering the fast and continuous evolution of deep learning methods for natural language
processing and the state-of-the-art results achieved by recent methods in tasks in this area such as the
pre-trained language model BERT (Bidirectional Encoder Representations from Transformers)
(Devlin et al., 2019), this approach will also be applied to the task of emotion recognition for both
languages using the same datasets from the previous experiments. It is expected to draw conclusions
about the adequacy of these two presented approaches in emotion recognition and to contribute to the
state of the art in this task for the Portuguese language.
While the approach of Polignano et al. had a better performance in the experiments with English
data with a difference in F1 scores of 0.02, for Portuguese we obtained the best result with BERT having
a maximum F1 score of 0.6124
Methods for improving entity linking and exploiting social media messages across crises
Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack
of precision on particularly ambiguous mentions often spoils the usefulness of automated
disambiguation results in real world applications.
In the first part of this thesis I explore an approximation of the difficulty to link entity
mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments.
An important component in the majority of the entity linking tools is the probability
that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one
approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic
Sabi\'a: Portuguese Large Language Models
As the capabilities of language models continue to advance, it is conceivable
that "one-size-fits-all" model will remain as the main paradigm. For instance,
given the vast number of languages worldwide, many of which are low-resource,
the prevalent practice is to pretrain a single model on multiple languages. In
this paper, we add to the growing body of evidence that challenges this
practice, demonstrating that monolingual pretraining on the target language
significantly improves models already extensively trained on diverse corpora.
More specifically, we further pretrain GPT-J and LLaMA models on Portuguese
texts using 3% or less of their original pretraining budget. Few-shot
evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models
outperform English-centric and multilingual counterparts by a significant
margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By
evaluating on datasets originally conceived in the target language as well as
translated ones, we study the contributions of language-specific pretraining in
terms of 1) capturing linguistic nuances and structures inherent to the target
language, and 2) enriching the model's knowledge about a domain or culture. Our
results indicate that the majority of the benefits stem from the
domain-specific knowledge acquired through monolingual pretraining
- …