493 research outputs found
Multilingual Twitter Sentiment Classification: The Role of Human Annotators
What are the limits of automated Twitter sentiment classification? We analyze
a large set of manually labeled tweets in different languages, use them as
training data, and construct automated classification models. It turns out that
the quality of classification models depends much more on the quality and size
of training data than on the type of the model trained. Experimental results
indicate that there is no statistically significant difference between the
performance of the top classification models. We quantify the quality of
training data by applying various annotator agreement measures, and identify
the weakest points of different datasets. We show that the model performance
approaches the inter-annotator agreement when the size of the training set is
sufficiently large. However, it is crucial to regularly monitor the self- and
inter-annotator agreements since this improves the training datasets and
consequently the model performance. Finally, we show that there is strong
evidence that humans perceive the sentiment classes (negative, neutral, and
positive) as ordered
A comparison of classification models to detect cyberbullying in the peruvian spanish language on Twitter
Cyberbullying is a social problem in which bullies’
actions are more harmful than in traditional forms of bullying as
they have the power to repeatedly humiliate the victim in front of
an entire community through social media. Nowadays, multiple
works aim at detecting acts of cyberbullying via the analysis of
texts in social media publications written in one or more
languages; however, few investigations target the cyberbullying
detection in the Spanish language. In this work, we aim to
compare four traditional supervised machine learning methods
performances in detecting cyberbullying via the identification of
four cyberbullying-related categories on Twitter posts written in
the Peruvian Spanish language. Specifically, we trained and
tested the Naive Bayes, Multinomial Logistic Regression, Support
Vector Machines, and Random Forest classifiers upon a
manually annotated dataset with the help of human participants.
The results indicate that the best performing classifier for the
cyberbullying detection task was the Support Vector Machine
classifier
A comparison of classification models to detect cyberbullying in the Peruvian Spanish language on twitter
Cyberbullying is a social problem in which bullies’ actions are more harmful than in traditional forms of bullying as they have the power to repeatedly humiliate the victim in front of an entire community through social media. Nowadays, multiple works aim at detecting acts of cyberbullying via the analysis of texts in social media publications written in one or more languages; however, few investigations target the cyberbullying detection in the Spanish language. In this work, we aim to compare four traditional supervised machine learning methods performances in detecting cyberbullying via the identification of four cyberbullying-related categories on Twitter posts written in the Peruvian Spanish language. Specifically, we trained and tested the Naive Bayes, Multinomial Logistic Regression, Support Vector Machines, and Random Forest classifiers upon a manually annotated dataset with the help of human participants. The results indicate that the best performing classifier for the cyberbullying detection task was the Support Vector Machine classifier
Computational Sociolinguistics: A Survey
Language is a social phenomenon and variation is inherent to its social
nature. Recently, there has been a surge of interest within the computational
linguistics (CL) community in the social dimension of language. In this article
we present a survey of the emerging field of "Computational Sociolinguistics"
that reflects this increased interest. We aim to provide a comprehensive
overview of CL research on sociolinguistic themes, featuring topics such as the
relation between language and social identity, language use in social
interaction and multilingual communication. Moreover, we demonstrate the
potential for synergy between the research communities involved, by showing how
the large-scale data-driven methods that are widely used in CL can complement
existing sociolinguistic studies, and how sociolinguistics can inform and
challenge the methods and assumptions employed in CL studies. We hope to convey
the possible benefits of a closer collaboration between the two communities and
conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication:
18th February, 201
Assessing the impact of contextual information in hate speech detection
In recent years, hate speech has gained great relevance in social networks
and other virtual media because of its intensity and its relationship with
violent acts against members of protected groups. Due to the great amount of
content generated by users, great effort has been made in the research and
development of automatic tools to aid the analysis and moderation of this
speech, at least in its most threatening forms. One of the limitations of
current approaches to automatic hate speech detection is the lack of context.
Most studies and resources are performed on data without context; that is,
isolated messages without any type of conversational context or the topic being
discussed. This restricts the available information to define if a post on a
social network is hateful or not. In this work, we provide a novel corpus for
contextualized hate speech detection based on user responses to news posts from
media outlets on Twitter. This corpus was collected in the Rioplatense
dialectal variety of Spanish and focuses on hate speech associated with the
COVID-19 pandemic. Classification experiments using state-of-the-art techniques
show evidence that adding contextual information improves hate speech detection
performance for two proposed tasks (binary and multi-label prediction). We make
our code, models, and corpus available for further research
Multilingual Text Classification from Twitter during Emergencies
Social media such as Twitter are a valuable source of information due to their diffusion among citizens and to their speed in sharing data worldwide. However, it is challenging to automatically extract information from such data, given the huge amount of useless content. We propose a multilingual tool that automatically categorizes tweets according to their information content. To achieve real-time classification while supporting any language, we apply a deep learning classifier, using multilingual word embeddings. This allows our solution to be trained on one language and to apply it to any other language via zero-shot inference achieving acceptable performance loss
Detecting portuguese and english Twitter users’ gender
Existing social networking services provide means for people to communicate and express
their feelings in a easy way. Such user generated content contains clues of user’s behaviors and
preferences, as well as other metadata information that is now available for scientific research.
Twitter, in particular, has become a relevant source for social networking studies, mainly because:
it provides a simple way for users to express their feelings, ideas, and opinions; makes
the user generated content and associated metadata available to the community; and furthermore
provides easy-to-use web interfaces and application programming interfaces (API) to access
data. For many studies, the available information about a user is relevant. However, the gender
attribute is not provided when creating a Twitter account.
The main focus of this study is to infer the users’ gender from other available information.
We propose a methodology for gender detection of Twitter users, using unstructured information
found on Twitter profile, user generated content, and later using the user’s profile picture.
In previous studies, one of the challenges presented was the labor-intensive task of manually
labelling datasets. In this study, we propose a method for creating extended labelled datasets in
a semi-automatic fashion. With the extended labelled datasets, we associate the users’ textual
content with their gender and created gender models, based on the users’ generated content and
profile information. We explore supervised and unsupervised classifiers and evaluate the results
in both Portuguese and English Twitter user datasets. We obtained an accuracy of 93.2% with
English users and an accuracy of 96.9% with Portuguese users. The proposed methodology of
our research is language independent, but our focus was given to Portuguese and English users.Os serviços de redes sociais existentes proporcionam meios para as pessoas comunicarem
e exprimirem os seus sentimentos de uma forma fácil. O conteúdo gerado por estes utilizadores
contém indícios dos seus comportamentos e preferências, bem como outros metadados que estão
agora disponíveis para investigação científica. O Twitter em particular, tornou-se uma fonte
importante para estudos das redes socias, sobretudo porque fornece um modo simples para os
utilizadores expressarem os seus sentimentos, ideias e opiniões; disponibiliza o conteúdo gerado
pelos utilizadores e os metadados associados à comunidade; e fornece interfaces web e interfaces
de programação de aplicações (API) para acesso aos dados de fácil utilização. Para muitos
estudos, a informação disponível sobre um utilizador é relevante. No entanto, o atributo de
género não é fornecido ao criar uma conta no Twitter.
O foco principal deste estudo é inferir o género dos utilizadores através da informação
disponível. Propomos uma metodologia para a detecção de género de utilizadores do Twitter,
usando informação não estruturada encontrada no perfil do Twitter, no conteúdo gerado pelo
utilizador, e mais tarde usando a imagem de perfil do utilizador. Em estudos anteriores, um dos
desafios apresentados foi a tarefa de etiquetar manualmente dados, que revelou exigir bastante
trabalho. Neste estudo, propomos um método para a criação de conjuntos de dados etiquetados
de uma forma semi-automática, utilizando um conjunto de atributos com base na informação
não estruturada de perfil. Utilizando os conjuntos de dados etiquetados, associamos conteúdo
textual ao seu género e criamos modelos, com base no conteúdo gerado pelos utilizadores, e
na informação de perfil. Exploramos classificadores supervisionados e não supervisionados e
avaliamos os resultados em ambos os conjuntos de dados de utilizadores Portugueses e Ingleses
do Twitter. Obtivemos uma precisão de 93,2% com utilizadores Ingleses e uma precisão de
96,9% com utilizadores Portugueses. A metodologia proposta é independente do idioma, mas
o foco foi dado a utilizadores Portugueses e Ingleses
Using word and phrase abbreviation patterns to extract age from Twitter microtexts
The wealth of texts available publicly online for analysis is ever increasing. Much work in computational linguistics focuses on syntactic, contextual, morphological and phonetic analysis on written documents, vocal recordings, or texts on the internet. Twitter messages present a unique challenge for computational linguistic analysis due to their constrained size. The constraint of 140 characters often prompts users to abbreviate words and phrases. Additionally, as an informal writing medium, messages are not expected to adhere to grammatically or orthographically standard English. As such, Twitter messages are noisy and do not necessarily conform to standard writing conventions of linguistic corpora, often requiring special pre-processing before advanced analysis can be done. In the area of computational linguistics, there is an interest in determining latent attributes of an author. Attributes such as author gender can be determined with some amount of success from many sources, using various methods, such as analysis of shallow linguistic patterns or topic. Author age is more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. Twitter messages present a difficult problem for latent user attribute analysis, due to the pre-processing necessary for many computational linguistics analysis tasks. An added logistical challenge is that very few latent attributes are explicitly defined by users on Twitter. Twitter messages are a part of an enormous data set, but the data set must be independently annotated for latent writer attributes not defined through the Twitter API before any classification on such attributes can be done. The actual classification problem is another particular challenge due to restrictions on tweet length. Previous work has shown that word and phrase abbreviation patterns used on Twitter can be indicative of some latent user attributes, such as geographic region or the Twitter client (iPhone, Android, Twitter website, etc.) used to make posts. Language change has generally been posited as being driven by women. This study explores if there there are age-related patterns or change in those patterns over time evident in Twitter posts from a variety of English authors. This work presents a growable data set annotated by Twitter users themselves for age and other useful attributes. The study also presents an extension of prior work on Twitter abbreviation patterns which shows that word and phrase abbreviation patterns can be used toward determining user age. Notable results include classification accuracy of up to 83%, which was 63% above relative majority class baseline (ZeroR in Weka) when classifying user ages into 6 equally sized age bins using a multilayer perceptron network classifier
#Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection
[EN] Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users¿ opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users¿ opinion shift dynamics that can be recorded during the debate. We analyzed the political discussion in UK about the BREXIT referendum on Twitter, proposing a novel approach and annotation schema for stance detection, with the
main aim of investigating the role of features related to social network community and diachronic stance evolution. Classification
experiments show that such features provide very useful clues for detecting stance.The work of P. Rosso was partially funded by the Spanish MICINN under the research projects MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech(PGC2018-096212-B-C31) and PROMETEO/2019/121 (DeepPattern) of the Generalitat Valenciana.
The work of V. Patti and G. Ruffo was partially funded by Progetto di Ateneo/CSP 2016 Immigrants, Hate and Prejudice in Social Media (S1618 L2 BOSC 01).Lai, M.; Patti, V.; Ruffo, G.; Rosso, P. (2020). #Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection. Journal of Intelligent & Fuzzy Systems. 39(2):2341-2352. https://doi.org/10.3233/JIFS-179895S23412352392Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. doi:10.1088/1742-5468/2008/10/p10008Deitrick, W., & Hu, W. (2013). Mutually Enhancing Community Detection and Sentiment Analysis on Twitter Networks. Journal of Data Analysis and Information Processing, 01(03), 19-29. doi:10.4236/jdaip.2013.13004Duranti A. and Goodwin C. , Rethinking context: Language as an interactive phenomenon, Cambridge University Press, (1992).Evans A. , Stance and identity in Twitter hashtags, Language@ Internet 13(1) (2016).Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3-5), 75-174. doi:10.1016/j.physrep.2009.11.002Gelman, A., & King, G. (1993). Why Are American Presidential Election Campaign Polls So Variable When Votes Are So Predictable? British Journal of Political Science, 23(4), 409-451. doi:10.1017/s0007123400006682Gonçalves, B., Perra, N., & Vespignani, A. (2011). Modeling Users’ Activity on Twitter Networks: Validation of Dunbar’s Number. PLoS ONE, 6(8), e22656. doi:10.1371/journal.pone.0022656González, M. C., Hidalgo, C. A., & Barabási, A.-L. (2008). Understanding individual human mobility patterns. Nature, 453(7196), 779-782. doi:10.1038/nature06958Hernández-Castañeda, Á., Calvo, H., & Gambino, O. J. (2018). Impact of polarity in deception detection. Journal of Intelligent & Fuzzy Systems, 35(1), 549-558. doi:10.3233/jifs-169610Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., … Van Alstyne, M. (2009). Computational Social Science. Science, 323(5915), 721-723. doi:10.1126/science.1167742Mohammad, S. M., Sobhani, P., & Kiritchenko, S. (2017). Stance and Sentiment in Tweets. ACM Transactions on Internet Technology, 17(3), 1-23. doi:10.1145/3003433Mohammad, S. M., & Turney, P. D. (2012). CROWDSOURCING A WORD-EMOTION ASSOCIATION LEXICON. Computational Intelligence, 29(3), 436-465. doi:10.1111/j.1467-8640.2012.00460.xPang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1-135. doi:10.1561/1500000011Pennebaker J.W. , Francis M.E. and Booth R.J. , Linguistic Inquiry and Word Count: LIWC 2001, Mahway: Lawrence Erlbaum Associates 71 (2001).Sulis, E., Irazú Hernández Farías, D., Rosso, P., Patti, V., & Ruffo, G. (2016). Figurative messages and affect in Twitter: Differences between #irony, #sarcasm and #not. Knowledge-Based Systems, 108, 132-143. doi:10.1016/j.knosys.2016.05.035Theocharis, Y., & Lowe, W. (2015). Does Facebook increase political participation? Evidence from a field experiment. Information, Communication & Society, 19(10), 1465-1486. doi:10.1080/1369118x.2015.1119871Whissell, C. (2009). Using the Revised Dictionary of Affect in Language to Quantify the Emotional Undertones of Samples of Natural Language. Psychological Reports, 105(2), 509-521. doi:10.2466/pr0.105.2.509-52
- …