6 research outputs found

    TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic

    Full text link
    Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning and natural language processing methods. With respect to the recent outbreak of the Coronavirus disease 2019 (COVID-19), online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigating measures and their societal impact. Understanding such discourse, its evolution, and interdependencies with real-world events or (mis)information can foster valuable insights. On the other hand, such corpora are crucial facilitators for computational methods addressing tasks such as sentiment analysis, event detection, or entity recognition. However, obtaining, archiving, and semantically annotating large amounts of tweets is costly. In this paper, we describe TweetsCOV19, a publicly available knowledge base of currently more than 8 million tweets, spanning October 2019 - April 2020. Metadata about the tweets as well as extracted entities, hashtags, user mentions, sentiments, and URLs are exposed using established RDF/S vocabularies, providing an unprecedented knowledge base for a range of knowledge discovery tasks. Next to a description of the dataset and its extraction and annotation process, we present an initial analysis and use cases of the corpus

    Apollo: Twitter stream analyzer of trending hashtags: A case-study of #COVID-19

    Get PDF
    This poster introduces a new tool named Apollo which analyzes textual information in the geo-tagged twitter streams of trending hashtags using sliding time window. It performs sentiment analysis as well as emotion detection of the opinions of the masses about a trending world wide topic such as #COVID-19, #ClimateChange, #Black-LivesMatter, etc. based on Knowledge Graphs. Apollo currently pro- vides an interactive visualization of the analysis of the trending hashtag #COVID-19

    Analyzing and Visualizing Twitter Streams based on Trending Hashtags

    Get PDF

    Strategies for the analysis of large social media corpora: sampling and keyword extraction methods

    Get PDF
    In the context of the COVID-19 pandemic, social media platforms such as Twitter have been of great importance for users to exchange news, ideas, and perceptions. Researchers from fields such as discourse analysis and the social sciences have resorted to this content to explore public opinion and stance on this topic, and they have tried to gather information through the compilation of large-scale corpora. However, the size of such corpora is both an advantage and a drawback, as simple text retrieval techniques and tools may prove to be impractical or altogether incapable of handling such masses of data. This study provides methodological and practical cues on how to manage the contents of a large-scale social media corpus such as Chen et al. (JMIR Public Health Surveill 6(2):e19273, 2020) COVID-19 corpus. We compare and evaluate, in terms of efficiency and efficacy, available methods to handle such a large corpus. First, we compare different sample sizes to assess whether it is possible to achieve similar results despite the size difference and evaluate sampling methods following a specific data management approach to storing the original corpus. Second, we examine two keyword extraction methodologies commonly used to obtain a compact representation of the main subject and topics of a text: the traditional method used in corpus linguistics, which compares word frequencies using a reference corpus, and graph-based techniques as developed in Natural Language Processing tasks. The methods and strategies discussed in this study enable valuable quantitative and qualitative analyses of an otherwise intractable mass of social media data.Funding for open access publishing: Universidad de Málaga/CBUA. This work was funded by the Spanish Ministry of Science and Innovation [Grant No. PID2020-115310RB-I00], the Regional Govvernment of Andalusia [Grant No. UMA18-FEDERJA-158] and the Spanish Ministry of Education and Vocational Training [Grant No. FPU 19/04880]. Funding for open access charge: Universidad de Málaga / CBU
    corecore