2,712 research outputs found
Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue
and Zika in Brasil and other tropical regions has long been a priority for
governments in affected areas. Streaming social media content, such as Twitter,
is increasingly being used for health vigilance applications such as flu
detection. However, previous work has not addressed the complexity of drastic
seasonal changes on Twitter content across multiple epidemic outbreaks. In
order to address this gap, this paper contrasts two complementary approaches to
detecting Twitter content that is relevant for Dengue outbreak detection,
namely supervised classification and unsupervised clustering using topic
modelling. Each approach has benefits and shortcomings. Our classifier achieves
a prediction accuracy of about 80\% based on a small training set of about
1,000 instances, but the need for manual annotation makes it hard to track
seasonal changes in the nature of the epidemics, such as the emergence of new
types of virus in certain geographical locations. In contrast, LDA-based topic
modelling scales well, generating cohesive and well-separated clusters from
larger samples. While clusters can be easily re-generated following changes in
epidemics, however, this approach makes it hard to clearly segregate relevant
tweets into well-defined clusters.Comment: Procs. SoWeMine - co-located with ICWE 2016. 2016, Lugano,
Switzerlan
Automatic Analysis of Facebook Posts and Comments Written in Brazilian Portuguese
Social networks and media are becoming increasingly important sources for knowing people\u27s opinions and sentiments on a wide variety of topics. The huge number of messages published daily in these media makes it impractical to analyze them without the help of natural language processing systems.This article presents an approach to cluster texts by similarity and identifying the sentiments expressed by comments on then (positive, negative and neutral, among others) in an integrated manner. Unlike most of the available studies that focus on the English language and use Twitter as a data source, we treat Brazilian Portuguese posts and comments published on Facebook. The proposed approach employs an unsupervised learning algorithm to group posts and a supervised algorithm to identify the sentiments expressed in comments to posts. In an experimental evaluation, a system that implements the proposed approach showed similar accuracy to that of human evaluators in the tasks of clustering and sentiment analysis, but performed the tasks in much less time
Sabi\'a: Portuguese Large Language Models
As the capabilities of language models continue to advance, it is conceivable
that "one-size-fits-all" model will remain as the main paradigm. For instance,
given the vast number of languages worldwide, many of which are low-resource,
the prevalent practice is to pretrain a single model on multiple languages. In
this paper, we add to the growing body of evidence that challenges this
practice, demonstrating that monolingual pretraining on the target language
significantly improves models already extensively trained on diverse corpora.
More specifically, we further pretrain GPT-J and LLaMA models on Portuguese
texts using 3% or less of their original pretraining budget. Few-shot
evaluations on Poeta, a suite of 14 Portuguese datasets, reveal that our models
outperform English-centric and multilingual counterparts by a significant
margin. Our best model, Sabi\'a-65B, performs on par with GPT-3.5-turbo. By
evaluating on datasets originally conceived in the target language as well as
translated ones, we study the contributions of language-specific pretraining in
terms of 1) capturing linguistic nuances and structures inherent to the target
language, and 2) enriching the model's knowledge about a domain or culture. Our
results indicate that the majority of the benefits stem from the
domain-specific knowledge acquired through monolingual pretraining
Transportation in Social Media: an automatic classifier for travel-related tweets
In the last years researchers in the field of intelligent transportation
systems have made several efforts to extract valuable information from social
media streams. However, collecting domain-specific data from any social media
is a challenging task demanding appropriate and robust classification methods.
In this work we focus on exploring geo-located tweets in order to create a
travel-related tweet classifier using a combination of bag-of-words and word
embeddings. The resulting classification makes possible the identification of
interesting spatio-temporal relations in S\~ao Paulo and Rio de Janeiro
Social Media Text Processing and Semantic Analysis for Smart Cities
With the rise of Social Media, people obtain and share information almost
instantly on a 24/7 basis. Many research areas have tried to gain valuable
insights from these large volumes of freely available user generated content.
With the goal of extracting knowledge from social media streams that might be
useful in the context of intelligent transportation systems and smart cities,
we designed and developed a framework that provides functionalities for
parallel collection of geo-located tweets from multiple pre-defined bounding
boxes (cities or regions), including filtering of non-complying tweets, text
pre-processing for Portuguese and English language, topic modeling, and
transportation-specific text classifiers, as well as, aggregation and data
visualization.
We performed an exploratory data analysis of geo-located tweets in 5
different cities: Rio de Janeiro, S\~ao Paulo, New York City, London and
Melbourne, comprising a total of more than 43 million tweets in a period of 3
months. Furthermore, we performed a large scale topic modelling comparison
between Rio de Janeiro and S\~ao Paulo. Interestingly, most of the topics are
shared between both cities which despite being in the same country are
considered very different regarding population, economy and lifestyle.
We take advantage of recent developments in word embeddings and train such
representations from the collections of geo-located tweets. We then use a
combination of bag-of-embeddings and traditional bag-of-words to train
travel-related classifiers in both Portuguese and English to filter
travel-related content from non-related. We created specific gold-standard data
to perform empirical evaluation of the resulting classifiers. Results are in
line with research work in other application areas by showing the robustness of
using word embeddings to learn word similarities that bag-of-words is not able
to capture
Knowledge-enhanced document embeddings for text classification
Accurate semantic representation models are essential in text mining applications. For a successful application of the text mining process, the text representation adopted must keep the interesting patterns to be discovered. Although competitive results for automatic text classification may be achieved with traditional bag of words, such representation model cannot provide satisfactory classification performances on hard settings where richer text representations are required. In this paper, we present an approach to represent document collections based on embedded representations of words and word senses. We bring together the power of word sense disambiguation and the semantic richness of word- and word-sense embedded vectors to construct embedded representations of document collections. Our approach results in semantically enhanced and low-dimensional representations. We overcome the lack of interpretability of embedded vectors, which is a drawback of this kind of representation, with the use of word sense embedded vectors. Moreover, the experimental evaluation indicates that the use of the proposed representations provides stable classifiers with strong quantitative results, especially in semantically-complex classification scenarios
- …