Search CORE

714 research outputs found

Methods for improving entity linking and exploiting social media messages across crises

Author: Stoffalette Joao Renato
Publication venue: Hannover : Institutionelles Repositorium der Gottfried Wilhelm Leibniz Unviersität Hannover
Publication date: 01/01/2023
Field of study

Entity Linking (EL) is the task of automatically identifying entity mentions in texts and resolving them to a corresponding entity in a reference knowledge base (KB). There is a large number of tools available for different types of documents and domains, however the literature in entity linking has shown the quality of a tool varies across different corpus and depends on specific characteristics of the corpus it is applied to. Moreover the lack of precision on particularly ambiguous mentions often spoils the usefulness of automated disambiguation results in real world applications. In the first part of this thesis I explore an approximation of the difficulty to link entity mentions and frame it as a supervised classification task. Classifying difficult to disambiguate entity mentions can facilitate identifying critical cases as part of a semi-automated system, while detecting latent corpus characteristics that affect the entity linking performance. Moreover, despiteless the large number of entity linking tools that have been proposed throughout the past years, some tools work better on short mentions while others perform better when there is more contextual information. To this end, I proposed a solution by exploiting results from distinct entity linking tools on the same corpus by leveraging their individual strengths on a per-mention basis. The proposed solution demonstrated to be effective and outperformed the individual entity systems employed in a series of experiments. An important component in the majority of the entity linking tools is the probability that a mentions links to one entity in a reference knowledge base, and the computation of this probability is usually done over a static snapshot of a reference KB. However, an entity’s popularity is temporally sensitive and may change due to short term events. Moreover, these changes might be then reflected in a KB and EL tools can produce different results for a given mention at different times. I investigated the prior probability change over time and the overall disambiguation performance using different KB from different time periods. The second part of this thesis is mainly concerned with short texts. Social media has become an integral part of the modern society. Twitter, for instance, is one of the most popular social media platforms around the world that enables people to share their opinions and post short messages about any subject on a daily basis. At first I presented one approach to identifying informative messages during catastrophic events using deep learning techniques. By automatically detecting informative messages posted by users during major events, it can enable professionals involved in crisis management to better estimate damages with only relevant information posted on social media channels, as well as to act immediately. Moreover I have also performed an analysis study on Twitter messages posted during the Covid-19 pandemic. Initially I collected 4 million tweets posted in Portuguese since the begining of the pandemic and provided an analysis of the debate aroud the pandemic. I used topic modeling, sentiment analysis and hashtags recomendation techniques to provide isights around the online discussion of the Covid-19 pandemic

Institutionelles Repositorium der Leibniz Universität Hannover

Public Opinion Without Polls: Investigating the feasibility of Twitter-based election forecasts

Author: Loynes Niklas
Publication venue
Publication date: 31/12/2021
Field of study

The University of Manchester - Institutional Repository

On the development of an information system for monitoring user opinion and its role for the public

Author: Karyukin Vladislav
Mamykova Zhanl
Mutanov Galimkair
Nassimova Gulnar
Negri Matteo
Sundetova Zhanerke
Torekul Saule
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Social media services and analytics platforms are rapidly growing. A large number of various events happen mostly every day, and the role of social media monitoring tools is also increasing. Social networks are widely used for managing and promoting brands and different services. Thus, most popular social analytics platforms aim for business purposes while monitoring various social, economic, and political problems remains underrepresented and not covered by thorough research. Moreover, most of them focus on resource-rich languages such as the English language, whereas texts and comments in other low-resource languages, such as the Russian and Kazakh languages in social media, are not represented well enough. So, this work is devoted to developing and applying the information system called the OMSystem for analyzing users' opinions on news portals, blogs, and social networks in Kazakhstan. The system uses sentiment dictionaries of the Russian and Kazakh languages and machine learning algorithms to determine the sentiment of social media texts. The whole structure and functionalities of the system are also presented. The experimental part is devoted to building machine learning models for sentiment analysis on the Russian and Kazakh datasets. Then the performance of the models is evaluated with accuracy, precision, recall, and F1-score metrics. The models with the highest scores are selected for implementation in the OMSystem. Then the OMSystem's social analytics module is used to thoroughly analyze the healthcare, political and social aspects of the most relevant topics connected with the vaccination against the coronavirus disease. The analysis allowed us to discover the public social mood in the cities of Almaty and Nur-Sultan and other large regional cities of Kazakhstan. The system's study included two extensive periods: 10-01-2021 to 30-05-2021 and 01-07-2021 to 12-08-2021. In the obtained results, people's moods and attitudes to the Government's policies and actions were studied by such social network indicators as the level of topic discussion activity in society, the level of interest in the topic in society, and the mood level of society. These indicators calculated by the OMSystem allowed careful identification of alarming factors of the public (negative attitude to the government regulations, vaccination policies, trust in vaccination, etc.) and assessment of the social mood

Archivio della ricerca - Fondazione Bruno Kessler

PubMed Central

Acquiring and Exploiting Lexical Knowledge for Twitter Sentiment Analysis

Author: Bravo-Marquez Felipe
Publication venue: 'University of Waikato'
Publication date: 18/07/2017
Field of study

The most popular sentiment analysis task in Twitter is the automatic classification of tweets into sentiment categories such as positive, negative, and neutral. State-of-the-art solutions to this problem are based on supervised machine learning models trained from manually annotated examples. These models are affected by label sparsity, because the manual annotation of tweets is labour-intensive and time-consuming. This thesis addresses the label sparsity problem for Twitter polarity classification by automatically building two type of resources that can be exploited when labelled data is scarce: opinion lexicons, which are lists of words labelled by sentiment, and synthetically labelled tweets. In the first part of the thesis, we induce Twitter-specific opinion lexicons by training words level classifiers using representations that exploit different sources of information: (a) the morphological information conveyed by part-of-speech (POS) tags, (b) associations between words and the sentiment expressed in the tweets that contain them, and (c) distributional representations calculated from unlabelled tweets. Experimental results show that the induced lexicons produce significant improvements over existing manually annotated lexicons for tweet-level polarity classification. In the second part of the thesis, we develop distant supervision methods for generating synthetic training data for Twitter polarity classification by exploiting unlabelled tweets and prior lexical knowledge. Positive and negative training instances are generated by averaging unlabelled tweets annotated according to a given polarity lexicon. We study different mechanisms for selecting the candidate tweets to be averaged. Our experimental results show that the training data generated by the proposed models produce classifiers that perform significantly better than classifiers trained from tweets annotated with emoticons, a popular distant supervision approach for Twitter sentiment analysis

Research Commons@Waikato

Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

Author
Publication venue: The Association for Computational Linguistics
Publication date: 19/04/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Challenges and perspectives of hate speech research

Author
Publication venue: Berlin
Publication date: 01/01/2023
Field of study

This book is the result of a conference that could not take place. It is a collection of 26 texts that address and discuss the latest developments in international hate speech research from a wide range of disciplinary perspectives. This includes case studies from Brazil, Lebanon, Poland, Nigeria, and India, theoretical introductions to the concepts of hate speech, dangerous speech, incivility, toxicity, extreme speech, and dark participation, as well as reflections on methodological challenges such as scraping, annotation, datafication, implicity, explainability, and machine learning. As such, it provides a much-needed forum for cross-national and cross-disciplinary conversations in what is currently a very vibrant field of research

SSOAR - Social Science Open Access Repository

A multi-disciplinary co-design approach to social media sensemaking with text mining

Author: Rogers David
Publication venue
Publication date
Field of study

This thesis presents the development of a bespoke social media analytics platform called Sentinel using an event driven co-design approach. The performance and outputs of this system, along with its integration into the routine research methodology of its users, were used to evaluate how the application of an event driven co-design approach to system design improves the degree to which Social Web data can be converted into actionable intelligence, with respect to robustness, agility, and usability. The thesis includes a systematic review into the state-of-the-art technology that can support real-time text analysis of social media data, used to position the text analysis elements of the Sentinel Pipeline. This is followed by research chapters that focus on combinations of robustness, agility, and usability as themes, covering the iterative developments of the system through the event driven co-design lifecycle. Robustness and agility are covered during initial infrastructure design and early prototyping of bottom-up and top-down semantic enrichment. Robustness and usability are then considered during the development of the Semantic Search component of the Sentinel Platform, which exploits the semantic enrichment developed in the prototype, alpha, and beta systems. Finally, agility and usability are used whilst building upon the Semantic Search functionality to produce a data download functionality for rapidly collecting corpora for further qualitative research. These iterations are evaluated using a number of case studies that were undertaken in conjunction with a wider research programme, within the field of crime and security, that the Sentinel platform was designed to support. The findings from these case studies are used in the co-design process to inform how developments should evolve. As part of this research programme the Sentinel platform has supported the production of a number of research papers authored by stakeholders, highlighting the impact the system has had in the field of crime and security researc

Online Research @ Cardiff

Directions in abusive language training data, a systematic review: Garbage in, garbage out

Author: Derczynski Leon
Vidgen Bertie
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets

arXiv.org e-Print Archive

Directory of Open Access Journals

The IT University of Copenhagen's Repository