725 research outputs found

    Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection

    Get PDF
    The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling. A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.Agencia Estatal de Investigación | Ref. PID2020–113673RB-I00Xunta de Galicia | Ref. ED431C2018/55Fundação para a Ciência e a Tecnologia | Ref. UIDB/04469/2020Financiado para publicación en acceso aberto: Universidade de Vigo/CISU

    Multilingual Stance Detection in Social Media Political Debates

    Get PDF
    [EN] Stance Detection is the task of automatically determining whether the author of a text is in favor, against, or neutral towards a given target. In this paper we investigate the portability of tools performing this task across different languages, by analyzing the results achieved by a Stance Detection system (i.e. MultiTACOS) trained and tested in a multilingual setting. First of all, a set of resources on topics related to politics for English, French, Italian, Spanish and Catalan is provided which includes: novel corpora collected for the purpose of this study, and benchmark corpora exploited in Stance Detection tasks and evaluation exercises known in literature. We focus in particular on the novel corpora by describing their development and by comparing them with the benchmarks. Second, MultiTACOS is applied with different sets of features especially designed for Stance Detection, with a specific focus to exploring and combining both features based on the textual content of the tweet (e.g., style and affective load) and features based on contextual information that do not emerge directly from the text. Finally, for better highlighting the contribution of the features that most positively affect system performance in the multilingual setting, a features analysis is provided, together with a qualitative analysis of the misclassified tweets for each of the observed languages, devoted to reflect on the open challenges.Cristina Bosco and Viviana Patti are partially supported by Progetto di Ateneo/CSP 2016 (Immigrants, Hate and Prejudice in Social Media, S1618_L2_BOSC_01). The work of Paolo Rosso was partially funded bythe Spanish MICINN under the research project MISMIS-FAKEnHATE on MISinformation and MIScommunication in social media: FAKE news and HATE speech (PGC2018096212-B-C31).Lai, M.; Cignarella, AT.; Hernandez-Farias, DI.; Bosco, C.; Patti, V.; Rosso, P. (2020). Multilingual Stance Detection in Social Media Political Debates. Computer Speech & Language. 63:1-27. https://doi.org/10.1016/j.csl.2020.101075S12763Balahur, A., & Turchi, M. (2014). Comparative experiments using supervised learning and machine translation for multilingual sentiment analysis. Computer Speech & Language, 28(1), 56-75. doi:10.1016/j.csl.2013.03.004Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008. doi:10.1088/1742-5468/2008/10/p10008Boiy, E., & Moens, M.-F. (2008). A machine learning approach to sentiment analysis in multilingual Web texts. Information Retrieval, 12(5), 526-558. doi:10.1007/s10791-008-9070-zDellaPosta, D., Shi, Y., & Macy, M. (2015). Why Do Liberals Drink Lattes? American Journal of Sociology, 120(5), 1473-1511. doi:10.1086/681254Küçük, D., Can, F., 2019. A tweet dataset annotated for named entity recognition and stance detection. arXiv preprint arXiv:1901.04787. Available at: https://arxiv.org.Mohammad, S. M., & Turney, P. D. (2012). CROWDSOURCING A WORD-EMOTION ASSOCIATION LEXICON. Computational Intelligence, 29(3), 436-465. doi:10.1111/j.1467-8640.2012.00460.xMohammad, S. M., Sobhani, P., & Kiritchenko, S. (2017). Stance and Sentiment in Tweets. ACM Transactions on Internet Technology, 17(3), 1-23. doi:10.1145/3003433Raghavan, U. N., Albert, R., & Kumara, S. (2007). Near linear time algorithm to detect community structures in large-scale networks. Physical Review E, 76(3). doi:10.1103/physreve.76.036106Vychegzhanin, S. V., & Kotelnikov, E. V. (2019). Stance Detection Based on Ensembles of Classifiers. Programming and Computer Software, 45(5), 228-240. doi:10.1134/s0361768819050074West, D. M. (1991). Polling effects in election campaigns. Political Behavior, 13(2), 151-163. doi:10.1007/bf00992294Whissell, C. (2009). Using the Revised Dictionary of Affect in Language to Quantify the Emotional Undertones of Samples of Natural Language. Psychological Reports, 105(2), 509-521. doi:10.2466/pr0.105.2.509-521Zappavigna, M. (2015). Searchable talk: the linguistic functions of hashtags. Social Semiotics, 25(3), 274-291. doi:10.1080/10350330.2014.99694

    MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark

    Full text link
    The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.Comment: 14 pages, 7 figure

    Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces

    Get PDF
    We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.Comment: To appear at NAACL 2018 (long
    • …
    corecore