Search CORE

493 research outputs found

Multilingual Twitter Sentiment Classification: The Role of Human Annotators

Author: Grcar Miha
Mozetic Igor
Smailovic Jasmina
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 23/02/2016
Field of study

What are the limits of automated Twitter sentiment classification? We analyze a large set of manually labeled tweets in different languages, use them as training data, and construct automated classification models. It turns out that the quality of classification models depends much more on the quality and size of training data than on the type of the model trained. Experimental results indicate that there is no statistically significant difference between the performance of the top classification models. We quantify the quality of training data by applying various annotator agreement measures, and identify the weakest points of different datasets. We show that the model performance approaches the inter-annotator agreement when the size of the training set is sufficiently large. However, it is crucial to regularly monitor the self- and inter-annotator agreements since this improves the training datasets and consequently the model performance. Finally, we show that there is strong evidence that humans perceive the sentiment classes (negative, neutral, and positive) as ordered

arXiv.org e-Print Archive

Common Language Resources and Technology Infrastructure - Slovenia

Directory of Open Access Journals

PubMed Central

Digital repository of Slovenian research organizations

A comparison of classification models to detect cyberbullying in the peruvian spanish language on Twitter

Author: Cuzcano Chavez Ximena Marianne
Publication venue: 'Dipartimento di Economia, Universita di Perugia (IT)'
Publication date: 01/01/2020
Field of study

Cyberbullying is a social problem in which bullies’ actions are more harmful than in traditional forms of bullying as they have the power to repeatedly humiliate the victim in front of an entire community through social media. Nowadays, multiple works aim at detecting acts of cyberbullying via the analysis of texts in social media publications written in one or more languages; however, few investigations target the cyberbullying detection in the Spanish language. In this work, we aim to compare four traditional supervised machine learning methods performances in detecting cyberbullying via the identification of four cyberbullying-related categories on Twitter posts written in the Peruvian Spanish language. Specifically, we trained and tested the Naive Bayes, Multinomial Logistic Regression, Support Vector Machines, and Random Forest classifiers upon a manually annotated dataset with the help of human participants. The results indicate that the best performing classifier for the cyberbullying detection task was the Support Vector Machine classifier

Repositorio Institucional Ulima

A comparison of classification models to detect cyberbullying in the Peruvian Spanish language on twitter

Author: Ayma Quirita Victor Hugo
Cuzcano Chavez Ximena Marianne
Publication venue: 'Indiana University Center for Genomics and Bioinformatics (CGB)'
Publication date: 01/01/2020
Field of study

Repositorio Institucional Ulima

Computational Sociolinguistics: A Survey

Author: de Jong Franciska
Doğruöz A. Seza
Nguyen Dong
Rosé Carolyn P.
Publication venue
Publication date: 01/01/2016
Field of study

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

arXiv.org e-Print Archive

Crossref

Ghent University Academic Bibliography

EUR Research Repository

University of Twente Research Information

Assessing the impact of contextual information in hate speech detection

Author: Cotik Viviana
Debandi Natalia
Gravano Agustín
Kondratzky Martín
Luque Franco
Miguel Paula
Moro Agustín
Pérez Juan Manuel
Serrati Pablo
Zajac Joaquín
Zayat Demian
Publication venue
Publication date: 01/01/2023
Field of study

In recent years, hate speech has gained great relevance in social networks and other virtual media because of its intensity and its relationship with violent acts against members of protected groups. Due to the great amount of content generated by users, great effort has been made in the research and development of automatic tools to aid the analysis and moderation of this speech, at least in its most threatening forms. One of the limitations of current approaches to automatic hate speech detection is the lack of context. Most studies and resources are performed on data without context; that is, isolated messages without any type of conversational context or the topic being discussed. This restricts the available information to define if a post on a social network is hateful or not. In this work, we provide a novel corpus for contextualized hate speech detection based on user responses to news posts from media outlets on Twitter. This corpus was collected in the Rioplatense dialectal variety of Spanish and focuses on hate speech associated with the COVID-19 pandemic. Classification experiments using state-of-the-art techniques show evidence that adding contextual information improves hate speech detection performance for two proposed tasks (binary and multi-label prediction). We make our code, models, and corpus available for further research

arXiv.org e-Print Archive

Repositorio Digital Universidad Torcuato Di Tella

Multilingual Text Classification from Twitter during Emergencies

Author: Arnaudo Edoardo
Piscitelli Sara
Rossi Claudio
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2021
Field of study

Social media such as Twitter are a valuable source of information due to their diffusion among citizens and to their speed in sharing data worldwide. However, it is challenging to automatically extract information from such data, given the huge amount of useless content. We propose a multilingual tool that automatically categorizes tweets according to their information content. To achieve real-time classification while supporting any language, we apply a deep learning classifier, using multilingual word embeddings. This allows our solution to be trained on one language and to apply it to any other language via zero-shot inference achieving acceptable performance loss

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Detecting portuguese and english Twitter users’ gender

Author: Vicente Marco Paulo Fernandes
Publication venue
Publication date: 01/01/2015
Field of study

Existing social networking services provide means for people to communicate and express their feelings in a easy way. Such user generated content contains clues of user’s behaviors and preferences, as well as other metadata information that is now available for scientific research. Twitter, in particular, has become a relevant source for social networking studies, mainly because: it provides a simple way for users to express their feelings, ideas, and opinions; makes the user generated content and associated metadata available to the community; and furthermore provides easy-to-use web interfaces and application programming interfaces (API) to access data. For many studies, the available information about a user is relevant. However, the gender attribute is not provided when creating a Twitter account. The main focus of this study is to infer the users’ gender from other available information. We propose a methodology for gender detection of Twitter users, using unstructured information found on Twitter profile, user generated content, and later using the user’s profile picture. In previous studies, one of the challenges presented was the labor-intensive task of manually labelling datasets. In this study, we propose a method for creating extended labelled datasets in a semi-automatic fashion. With the extended labelled datasets, we associate the users’ textual content with their gender and created gender models, based on the users’ generated content and profile information. We explore supervised and unsupervised classifiers and evaluate the results in both Portuguese and English Twitter user datasets. We obtained an accuracy of 93.2% with English users and an accuracy of 96.9% with Portuguese users. The proposed methodology of our research is language independent, but our focus was given to Portuguese and English users.Os serviços de redes sociais existentes proporcionam meios para as pessoas comunicarem e exprimirem os seus sentimentos de uma forma fácil. O conteúdo gerado por estes utilizadores contém indícios dos seus comportamentos e preferências, bem como outros metadados que estão agora disponíveis para investigação científica. O Twitter em particular, tornou-se uma fonte importante para estudos das redes socias, sobretudo porque fornece um modo simples para os utilizadores expressarem os seus sentimentos, ideias e opiniões; disponibiliza o conteúdo gerado pelos utilizadores e os metadados associados à comunidade; e fornece interfaces web e interfaces de programação de aplicações (API) para acesso aos dados de fácil utilização. Para muitos estudos, a informação disponível sobre um utilizador é relevante. No entanto, o atributo de género não é fornecido ao criar uma conta no Twitter. O foco principal deste estudo é inferir o género dos utilizadores através da informação disponível. Propomos uma metodologia para a detecção de género de utilizadores do Twitter, usando informação não estruturada encontrada no perfil do Twitter, no conteúdo gerado pelo utilizador, e mais tarde usando a imagem de perfil do utilizador. Em estudos anteriores, um dos desafios apresentados foi a tarefa de etiquetar manualmente dados, que revelou exigir bastante trabalho. Neste estudo, propomos um método para a criação de conjuntos de dados etiquetados de uma forma semi-automática, utilizando um conjunto de atributos com base na informação não estruturada de perfil. Utilizando os conjuntos de dados etiquetados, associamos conteúdo textual ao seu género e criamos modelos, com base no conteúdo gerado pelos utilizadores, e na informação de perfil. Exploramos classificadores supervisionados e não supervisionados e avaliamos os resultados em ambos os conjuntos de dados de utilizadores Portugueses e Ingleses do Twitter. Obtivemos uma precisão de 93,2% com utilizadores Ingleses e uma precisão de 96,9% com utilizadores Portugueses. A metodologia proposta é independente do idioma, mas o foco foi dado a utilizadores Portugueses e Ingleses

Repositório Institucional do ISCTE-IUL

Using word and phrase abbreviation patterns to extract age from Twitter microtexts

Author: Moseley Nathaniel
Publication venue: RIT Scholar Works
Publication date: 20/05/2013
Field of study

The wealth of texts available publicly online for analysis is ever increasing. Much work in computational linguistics focuses on syntactic, contextual, morphological and phonetic analysis on written documents, vocal recordings, or texts on the internet. Twitter messages present a unique challenge for computational linguistic analysis due to their constrained size. The constraint of 140 characters often prompts users to abbreviate words and phrases. Additionally, as an informal writing medium, messages are not expected to adhere to grammatically or orthographically standard English. As such, Twitter messages are noisy and do not necessarily conform to standard writing conventions of linguistic corpora, often requiring special pre-processing before advanced analysis can be done. In the area of computational linguistics, there is an interest in determining latent attributes of an author. Attributes such as author gender can be determined with some amount of success from many sources, using various methods, such as analysis of shallow linguistic patterns or topic. Author age is more difficult to determine, but previous research has been somewhat successful at classifying age as a binary (e.g. over or under 30), ternary, or even as a continuous variable using various techniques. Twitter messages present a difficult problem for latent user attribute analysis, due to the pre-processing necessary for many computational linguistics analysis tasks. An added logistical challenge is that very few latent attributes are explicitly defined by users on Twitter. Twitter messages are a part of an enormous data set, but the data set must be independently annotated for latent writer attributes not defined through the Twitter API before any classification on such attributes can be done. The actual classification problem is another particular challenge due to restrictions on tweet length. Previous work has shown that word and phrase abbreviation patterns used on Twitter can be indicative of some latent user attributes, such as geographic region or the Twitter client (iPhone, Android, Twitter website, etc.) used to make posts. Language change has generally been posited as being driven by women. This study explores if there there are age-related patterns or change in those patterns over time evident in Twitter posts from a variety of English authors. This work presents a growable data set annotated by Twitter users themselves for age and other useful attributes. The study also presents an extension of prior work on Twitter abbreviation patterns which shows that word and phrase abbreviation patterns can be used toward determining user age. Notable results include classification accuracy of up to 83%, which was 63% above relative majority class baseline (ZeroR in Weka) when classifying user ages into 6 equally sized age bins using a multilayer perceptron network classifier

RIT Scholar Works

#Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection

Author: Blondel
Deitrick
Fortunato
Gelman
Gonzalez
Gonçalves
Hernández-Castañeda
Lazer
Mohammad
Mohammad
Pang
Sulis
Theocharis
Whissell
Publication venue: 'IOS Press'
Publication date: 01/01/2020
Field of study

[EN] Interest has grown around the classification of stance that users assume within online debates in recent years. Stance has been usually addressed by considering users posts in isolation, while social studies highlight that social communities may contribute to influence users¿ opinion. Furthermore, stance should be studied in a diachronic perspective, since it could help to shed light on users¿ opinion shift dynamics that can be recorded during the debate. We analyzed the political discussion in UK about the BREXIT referendum on Twitter, proposing a novel approach and annotation schema for stance detection, with the main aim of investigating the role of features related to social network community and diachronic stance evolution. Classification experiments show that such features provide very useful clues for detecting stance.The work of P. Rosso was partially funded by the Spanish MICINN under the research projects MISMIS-FAKEnHATE on Misinformation and Miscommunication in social media: FAKE news and HATE speech(PGC2018-096212-B-C31) and PROMETEO/2019/121 (DeepPattern) of the Generalitat Valenciana. The work of V. Patti and G. Ruffo was partially funded by Progetto di Ateneo/CSP 2016 Immigrants, Hate and Prejudice in Social Media (S1618 L2 BOSC 01).Lai, M.; Patti, V.; Ruffo, G.; Rosso, P. (2020). #Brexit: Leave or Remain? The Role of User's Community and Diachronic Evolution on Stance Detection. Journal of Intelligent & Fuzzy Systems. 39(2):2341-2352. https://doi.org/10.3233/JIFS-179895S23412352392Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. doi:10.1088/1742-5468/2008/10/p10008Deitrick, W., & Hu, W. (2013). Mutually Enhancing Community Detection and Sentiment Analysis on Twitter Networks. Journal of Data Analysis and Information Processing, 01(03), 19-29. doi:10.4236/jdaip.2013.13004Duranti A. and Goodwin C. , Rethinking context: Language as an interactive phenomenon, Cambridge University Press, (1992).Evans A. , Stance and identity in Twitter hashtags, Language@ Internet 13(1) (2016).Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3-5), 75-174. doi:10.1016/j.physrep.2009.11.002Gelman, A., & King, G. (1993). Why Are American Presidential Election Campaign Polls So Variable When Votes Are So Predictable? British Journal of Political Science, 23(4), 409-451. doi:10.1017/s0007123400006682Gonçalves, B., Perra, N., & Vespignani, A. (2011). Modeling Users’ Activity on Twitter Networks: Validation of Dunbar’s Number. PLoS ONE, 6(8), e22656. doi:10.1371/journal.pone.0022656González, M. C., Hidalgo, C. A., & Barabási, A.-L. (2008). Understanding individual human mobility patterns. Nature, 453(7196), 779-782. doi:10.1038/nature06958Hernández-Castañeda, Á., Calvo, H., & Gambino, O. J. (2018). Impact of polarity in deception detection. Journal of Intelligent & Fuzzy Systems, 35(1), 549-558. doi:10.3233/jifs-169610Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., … Van Alstyne, M. (2009). Computational Social Science. Science, 323(5915), 721-723. doi:10.1126/science.1167742Mohammad, S. M., Sobhani, P., & Kiritchenko, S. (2017). Stance and Sentiment in Tweets. ACM Transactions on Internet Technology, 17(3), 1-23. doi:10.1145/3003433Mohammad, S. M., & Turney, P. D. (2012). CROWDSOURCING A WORD-EMOTION ASSOCIATION LEXICON. Computational Intelligence, 29(3), 436-465. doi:10.1111/j.1467-8640.2012.00460.xPang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends® in Information Retrieval, 2(1–2), 1-135. doi:10.1561/1500000011Pennebaker J.W. , Francis M.E. and Booth R.J. , Linguistic Inquiry and Word Count: LIWC 2001, Mahway: Lawrence Erlbaum Associates 71 (2001).Sulis, E., Irazú Hernández Farías, D., Rosso, P., Patti, V., & Ruffo, G. (2016). Figurative messages and affect in Twitter: Differences between #irony, #sarcasm and #not. Knowledge-Based Systems, 108, 132-143. doi:10.1016/j.knosys.2016.05.035Theocharis, Y., & Lowe, W. (2015). Does Facebook increase political participation? Evidence from a field experiment. Information, Communication & Society, 19(10), 1465-1486. doi:10.1080/1369118x.2015.1119871Whissell, C. (2009). Using the Revised Dictionary of Affect in Language to Quantify the Emotional Undertones of Samples of Natural Language. Psychological Reports, 105(2), 509-521. doi:10.2466/pr0.105.2.509-52

arXiv.org e-Print Archive

Crossref

RiuNet

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale