112 research outputs found
PolyHope: Two-Level Hope Speech Detection from Tweets
Hope is characterized as openness of spirit toward the future, a desire,
expectation, and wish for something to happen or to be true that remarkably
affects human's state of mind, emotions, behaviors, and decisions. Hope is
usually associated with concepts of desired expectations and
possibility/probability concerning the future. Despite its importance, hope has
rarely been studied as a social media analysis task. This paper presents a hope
speech dataset that classifies each tweet first into "Hope" and "Not Hope",
then into three fine-grained hope categories: "Generalized Hope", "Realistic
Hope", and "Unrealistic Hope" (along with "Not Hope"). English tweets in the
first half of 2022 were collected to build this dataset. Furthermore, we
describe our annotation process and guidelines in detail and discuss the
challenges of classifying hope and the limitations of the existing hope speech
detection corpora. In addition, we reported several baselines based on
different learning approaches, such as traditional machine learning, deep
learning, and transformers, to benchmark our dataset. We evaluated our
baselines using weighted-averaged and macro-averaged F1-scores. Observations
show that a strict process for annotator selection and detailed annotation
guidelines enhanced the dataset's quality. This strict annotation process
resulted in promising performance for simple machine learning classifiers with
only bi-grams; however, binary and multiclass hope speech detection results
reveal that contextual embedding models have higher performance in this
dataset.Comment: 20 pages, 9 figure
Automatic Translation of Hate Speech to Non-hate Speech in Social Media Texts
In this paper, we investigate the issue of hate speech by presenting a novel
task of translating hate speech into non-hate speech text while preserving its
meaning. As a case study, we use Spanish texts. We provide a dataset and
several baselines as a starting point for further research in the task. We
evaluated our baseline results using multiple metrics, including BLEU scores.
The aim of this study is to contribute to the development of more effective
methods for reducing the spread of hate speech in online communities
Creación y evaluación de un diccionario marcado con emociones y ponderado para el español
Este artÃculo presenta un método para la creación de diccionarios marcados con un valor especÃfico (por ejemplo, las emociones, la polaridad) para su uso en varias tareas de procesamiento de lenguaje natural realizadas por computadoras. En el diccionario creado las palabras seleccionadas se han etiquetado con seis emociones básicas. Para eso, las palabras primero fueron analizadas (anotadas) manualmente por múltiples evaluadores y ponderadas automáticamente a base de estas. El método se aplicó para el idioma español. Las palabras que conforman el diccionario fueron etiquetadas con las categorÃas emocionales básicas: alegrÃa, enojo, miedo, tristeza, sorpresa y repulsión. A diferencia de otros diccionarios para computadoras, el diccionario propuesto contiene ponderaciones—porcentajes de probabilidad de ser usadas con un sentido emocional—. Cada palabra fue valorada por múltiples evaluadores, y posteriormente se realizó un análisis de concordancia con el método de kappa ponderado, adaptándolo para evaluadores múltiples. Con los resultados obtenidos, se propuso una medida que estima qué tan frecuente es el uso afectivo de una palabra: factor de probabilidad de uso afectivo (FPA), el cual sirve para dotar a las palabras potencialmente emocionales con un factor de peso. El FPA puede ser incluido como información en sistemas automáticos, por ejemplo, para la detección de sentimientos en texto. El FPA se refiere a la tendencia del uso de cada palabra, no es una caracterÃstica absoluta. AsÃ, es útil para los sistemas automáticos
UrduFake@FIRE2021: Shared Track on Fake News Identification in Urdu
This study reports the second shared task named as UrduFake@FIRE2021 on
identifying fake news detection in Urdu language. This is a binary
classification problem in which the task is to classify a given news article
into two classes: (i) real news, or (ii) fake news. In this shared task, 34
teams from 7 different countries (China, Egypt, Israel, India, Mexico,
Pakistan, and UAE) registered to participate in the shared task, 18 teams
submitted their experimental results and 11 teams submitted their technical
reports. The proposed systems were based on various count-based features and
used different classifiers as well as neural network architectures. The
stochastic gradient descent (SGD) algorithm outperformed other classifiers and
achieved 0.679 F-score
Overview of the Shared Task on Fake News Detection in Urdu at FIRE 2021
Automatic detection of fake news is a highly important task in the
contemporary world. This study reports the 2nd shared task called
UrduFake@FIRE2021 on identifying fake news detection in Urdu. The goal of the
shared task is to motivate the community to come up with efficient methods for
solving this vital problem, particularly for the Urdu language. The task is
posed as a binary classification problem to label a given news article as a
real or a fake news article. The organizers provide a dataset comprising news
in five domains: (i) Health, (ii) Sports, (iii) Showbiz, (iv) Technology, and
(v) Business, split into training and testing sets. The training set contains
1300 annotated news articles -- 750 real news, 550 fake news, while the testing
set contains 300 news articles -- 200 real, 100 fake news. 34 teams from 7
different countries (China, Egypt, Israel, India, Mexico, Pakistan, and UAE)
registered to participate in the UrduFake@FIRE2021 shared task. Out of those,
18 teams submitted their experimental results, and 11 of those submitted their
technical reports, which is substantially higher compared to the UrduFake
shared task in 2020 when only 6 teams submitted their technical reports. The
technical reports submitted by the participants demonstrated different data
representation techniques ranging from count-based BoW features to word vector
embeddings as well as the use of numerous machine learning algorithms ranging
from traditional SVM to various neural network architectures including
Transformers such as BERT and RoBERTa. In this year's competition, the best
performing system obtained an F1-macro score of 0.679, which is lower than the
past year's best result of 0.907 F1-macro. Admittedly, while training sets from
the past and the current years overlap to a large extent, the testing set
provided this year is completely different
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
This paper describes CIC NLP's submission to the AmericasNLP 2023 Shared Task
on machine translation systems for indigenous languages of the Americas. We
present the system descriptions for three methods. We used two multilingual
models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki
NLP Spanish-English translation model, and experimented with different transfer
learning setups. We experimented with 11 languages from America and report the
setups we used as well as the results we achieved. Overall, the mBART setup was
able to improve upon the baseline for three out of the eleven languages.Comment: Accepted to Third Workshop on NLP for Indigenous Languages of the
America
Adapting Cross-Genre Author Profiling to Language and Corpus Notebook for PAN at CLEF 2016
Abstract This paper presents our approach to the Author Profiling (AP) task at PAN 2016. The task aims at identifying the author's age and gender under crossgenre AP conditions in three languages: English, Spanish, and Dutch. Our preprocessing stage includes reducing non-textual features to their corresponding semantic classes. We exploit typed character n-grams, lexical features, and nontextual features (domain names). We experimented with various feature representations (binary, raw frequency, normalized frequency, second order attributes (SOA), tf-idf) and machine learning algorithms (liblinear and libSVM implementations of Support Vector Machines (SVM), multinomial naive Bayes, logistic regression). For textual feature selection, we applied the transition point technique, except when SOA was used. We found that the optimal configuration was different for different languages at each stage
- …