459 research outputs found
Listening between the Lines: Learning Personal Attributes from Conversations
Open-domain dialogue agents must be able to converse about many topics while
incorporating knowledge about the user into the conversation. In this work we
address the acquisition of such knowledge, for personalization in downstream
Web applications, by extracting personal attributes from conversations. This
problem is more challenging than the established task of information extraction
from scientific publications or Wikipedia articles, because dialogues often
give merely implicit cues about the speaker. We propose methods for inferring
personal attributes, such as profession, age or family status, from
conversations using deep learning. Specifically, we propose several Hidden
Attribute Models, which are neural networks leveraging attention mechanisms and
embeddings. Our methods are trained on a per-predicate basis to output rankings
of object values for a given subject-predicate combination (e.g., ranking the
doctor and nurse professions high when speakers talk about patients, emergency
rooms, etc). Experiments with various conversational texts including Reddit
discussions, movie scripts and a collection of crowdsourced personal dialogues
demonstrate the viability of our methods and their superior performance
compared to state-of-the-art baselines.Comment: published in WWW'1
Profiling hate speech spreaders on twitter task at PAN 2021
[EN] This overview presents the Author Profiling shared task at PAN 2021. The focus of this yearÂżs task is on determining whether or not the author of a Twitter feed is keen to spread hate speech. The main aim is to show the feasibility of automatically identifying potential hate speech spreaders on Twitter. For this purpose a corpus with Twitter data has been provided, covering the English and Spanish languages. Altogether, the approaches of 66 participants have been evaluated.First of all, we thank the participants: again 66 this year, as the previous year on Profiling Fake
News Spreaders! We have to thank also Martin Potthast, Matti Wiegmann, Nikolay Kolyada, and
Magdalena Anna Wolska for their technical support with the TIRA platform. We thank Symanto
for sponsoring again the award for the best performing system at the author profiling shared
task. The work of Francisco Rangel was partially funded by the Centre for the Development
of Industrial Technology (CDTI) of the Spanish Ministry of Science and Innovation under the
research project IDI-20210776 on Proactive Profiling of Hate Speech Spreaders - PROHATER
(Perfilador Proactivo de Difusores de Mensajes de Odio). The work of the researchers from
Universitat Politècnica de València was partially funded by the Spanish MICINN under the
project MISMIS-FAKEnHATE on MISinformation and MIScommunication in social media: FAKE
news and HATE speech (PGC2018-096212-B-C31), and by the Generalitat Valenciana under
the project DeepPattern (PROMETEO/2019/121). This article is also based upon work from the
Dig-ForAsp COST Action 17124 on Digital Forensics: evidence analysis via intelligent systems
and practices, supported by European Cooperation in Science and Technology.Rangel, F.; Peña-Sarracén, GLDL.; Chulvi-Ferriols, MA.; Fersini, E.; Rosso, P. (2021). Profiling hate speech spreaders on twitter task at PAN 2021. CEUR. 1772-1789. http://hdl.handle.net/10251/1906631772178
Using TF-IDF n-gram and word embedding cluster ensembles for author profiling: Notebook for PAN at CLEF 2017
This paper presents our approach and results for the 2017 PAN Author Profiling Shared Task. Language-specific corpora were provided for four langauges: Spanish, English, Portuguese, and Arabic. Each corpus consisted of tweets authored by a number of Twitter users labeled with their gender and the specific variant of their language which was used in the documents (e.g. Brazilian or European Portuguese). The task was to develop a system to infer the same attributes for unseen Twitter users. Our system employs an ensemble of two probabilistic classifiers: a Logistic regression classifier trained on TF-IDF transformed n-grams and a Gaussian Process classifier trained on word embedding clusters derived for an additional, external corpus of tweets
GlobalTrait: Personality Alignment of Multilingual Word Embeddings
We propose a multilingual model to recognize Big Five Personality traits from
text data in four different languages: English, Spanish, Dutch and Italian. Our
analysis shows that words having a similar semantic meaning in different
languages do not necessarily correspond to the same personality traits.
Therefore, we propose a personality alignment method, GlobalTrait, which has a
mapping for each trait from the source language to the target language
(English), such that words that correlate positively to each trait are close
together in the multilingual vector space. Using these aligned embeddings for
training, we can transfer personality related training features from
high-resource languages such as English to other low-resource languages, and
get better multilingual results, when compared to using simple monolingual and
unaligned multilingual embeddings. We achieve an average F-score increase
(across all three languages except English) from 65 to 73.4 (+8.4), when
comparing our monolingual model to multilingual using CNN with personality
aligned embeddings. We also show relatively good performance in the regression
tasks, and better classification results when evaluating our model on a
separate Chinese dataset.Comment: Submitted and accepted to AAAI 2019 conferenc
Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter
[EN] This overview presents the Author Profiling shared task at
PAN 2020. The focus of this year's task is on determining whether or not
the author of a Twitter feed is keen to spread fake news. Two have been
the main aims: (i) to show the feasibility of automatically identifying
potential fake news spreaders in Twitter; and (ii) to show the difficulty
of identifying them when they do not limit themselves to just retweet
domain-specific news. For this purpose a corpus with Twitter data has
been provided, covering the English and Spanish languages. Altogether,
the approaches of 66 participants have been evaluated.First of all we thank the participants: 66 this year, record in terms of participants at PAN Lab since 2009! We have to thank also Martin Potthast, Matti
Wiegmann, and Nikolay Kolyada to help with the 66 Virtual Machines in the
TIRA platform. We thank Symanto for sponsoring the ex aequo award for the two best performing systems at the author profiling shared task of this year. The
work of Paolo Rosso was partially funded by the Spanish MICINN under the
research project MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech (PGC2018-096212-B-C31).
The work of Anastasia Giachanou is supported by the SNSF Early Postdoc
Mobility grant under the project Early Fake News Detection on Social Media,
Switzerland (P2TIP2 181441).Rangel, F.; Giachanou, A.; Ghanem, BHH.; Rosso, P. (2020). Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter. CEUR Workshop Proceedings. 2696:1-18. http://hdl.handle.net/10251/166528S118269
A Word Embeddings based Approach for Author Profiling: Gender and Age Prediction
Author Profiling (AP) is a method of identifying the demographic profiles such as age, gender, location, native language and personality traits of an author by processing their written texts. The AP techniques are used in multiple applications such as literary research, marketing, forensics and security. The researchers identified various differences in the authors writing styles by analysing various datasets. The differences in writing styles are represented as stylistic features. The researchers extracted several style based features like structural, content, word, character, syntactic, readability and semantic features to recognize the profiles of the authors. Traditionally, the researchers extracted various feature combinations for differentiating the profiles of authors. Several existing works are used Machine Learning (ML) methods for predicting the author characteristics of a new author. The existing works achieved good accuracies for predicting the author characteristics by considering the both stylistic features and ML algorithms combination. Recently, in advent of Deep Learning (DL) techniques the researchers are proposed approaches to author profiling by using these techniques. Few researchers identified that the deep learning techniques performance is good for author profiles prediction than the results of style based features. In this work, a word embeddings based approach is proposed for gender and age prediction. In this approach, the experiment conducted with different word embedding models such as Word2Vec, GloVe, FastText and BERT for generating word vectors for words. The documents are converted as vectors by using the document representation technique which uses the word embeddings of words. The document vectors are transferred to three different ML algorithms such as Extreme Gradient Boosting (XGBoost), Random Forest (RF) and Logistic Regression (LR) for generating the trained model. This model is used for predicating the accuracy of age and gender prediction. The XGBoost classifier with word embeddings of BERT achieved good accuracies for age and gender prediction than other word embeddings and ML algorithms. The experiment implemented on PAN 2014 competition Reviews dataset for age and gender prediction. The proposed approach attained best accuracies for predicting age and gender than the performances of various existing approaches proposed for AP
Phonetic Detection for Hate Speech Spreaders on Twitter
Nowadays, hate messages have become the object of study on social media. Efficient and effective
detection of hate profiles requires various scientific disciplines, such as computational linguistics and
sociology. Here, we illustrate how we used lexical and phonetic features to determine if the author
spreads hate speech. This article presents a novel strategy for the characterization of the Twitter profile
based on the generation of lexical and phonetic user features that serve as input to a set of classifiers.
The results are part of our participation in the PAN 2021 in the CLEF in the task of Profiling Hate Speech
Spreaders on Twitter
Listening between the Lines: Learning Personal Attributes from Conversations
Open-domain dialogue agents must be able to converse about many topics while incorporating knowledge about the user into the conversation. In this work we address the acquisition of such knowledge, for personalization in downstream Web applications, by extracting personal attributes from conversations. This problem is more challenging than the established task of information extraction from scientific publications or Wikipedia articles, because dialogues often give merely implicit cues about the speaker. We propose methods for inferring personal attributes, such as profession, age or family status, from conversations using deep learning. Specifically, we propose several Hidden Attribute Models, which are neural networks leveraging attention mechanisms and embeddings. Our methods are trained on a per-predicate basis to output rankings of object values for a given subject-predicate combination (e.g., ranking the doctor and nurse professions high when speakers talk about patients, emergency rooms, etc). Experiments with various conversational texts including Reddit discussions, movie scripts and a collection of crowdsourced personal dialogues demonstrate the viability of our methods and their superior performance compared to state-of-the-art baselines
- …