44 research outputs found

    A Large-Scale COVID-19 Twitter Chatter Dataset for Open Scientific Research-An International Collaboration

    Get PDF
    Ajuts: This work was partially supported by the National Institute of Aging through Stanford University's Stanford Aging and Ethnogeriatrics Transdisciplinary Collaborative Center (SAGE) center (award 3P30AG059307-02S1). The work on the collection of Russian tweets was performed by Elena Tutubalina and supported by the Russian Science Foundation (grant number 18-11-00284).As the COVID-19 pandemic continues to spread worldwide, an unprecedented amount of open data is being generated for medical, genetics, and epidemiological research. The unparalleled rate at which many research groups around the world are releasing data and publications on the ongoing pandemic is allowing other scientists to learn from local experiences and data generated on the front lines of the COVID-19 pandemic. However, there is a need to integrate additional data sources that map and measure the role of social dynamics of such a unique worldwide event in biomedical, biological, and epidemiological analyses. For this purpose, we present a large-scale curated dataset of over 1.12 billion tweets, growing daily, related to COVID-19 chatter generated from 1 January 2020 to 27 June 2021 at the time of writing. This data source provides a freely available additional data source for researchers worldwide to conduct a wide and diverse number of research projects, such as epidemiological analyses, emotional and mental responses to social distancing measures, the identification of sources of misinformation, stratified measurement of sentiment towards the pandemic in near real time, among many others

    When Silver Is As Good As Gold: Using Weak Supervision to Train Machine Learning Models on Social Media Data

    Get PDF
    Over the last decade, advances in machine learning have led to an exponential growth in artificial intelligence i.e., machine learning models capable of learning from vast amounts of data to perform several tasks such as text classification, regression, machine translation, speech recognition, and many others. While massive volumes of data are available, due to the manual curation process involved in the generation of training datasets, only a percentage of the data is used to train machine learning models. The process of labeling data with a ground-truth value is extremely tedious, expensive, and is the major bottleneck of supervised learning. To curtail this, the theory of noisy learning can be employed where data labeled through heuristics, knowledge bases and weak classifiers can be utilized for training, instead of data obtained through manual annotation. The assumption here is that a large volume of training data, which contains noise and acquired through an automated process, can compensate for the lack of manual labels. In this study, we utilize heuristic based approaches to create noisy silver standard datasets. We extensively tested the theory of noisy learning on four different applications by training several machine learning models using the silver standard dataset with several sample sizes and class imbalances and tested the performance using a gold standard dataset. Our evaluations on the four applications indicate the success of silver standard datasets in identifying a gold standard dataset. We conclude the study with evidence that noisy social media data can be utilized for weak supervisio

    Leveraging Large Language Models and Weak Supervision for Social Media data annotation: an evaluation using COVID-19 self-reported vaccination tweets

    Full text link
    The COVID-19 pandemic has presented significant challenges to the healthcare industry and society as a whole. With the rapid development of COVID-19 vaccines, social media platforms have become a popular medium for discussions on vaccine-related topics. Identifying vaccine-related tweets and analyzing them can provide valuable insights for public health research-ers and policymakers. However, manual annotation of a large number of tweets is time-consuming and expensive. In this study, we evaluate the usage of Large Language Models, in this case GPT-4 (March 23 version), and weak supervision, to identify COVID-19 vaccine-related tweets, with the purpose of comparing performance against human annotators. We leveraged a manu-ally curated gold-standard dataset and used GPT-4 to provide labels without any additional fine-tuning or instructing, in a single-shot mode (no additional prompting)

    When Infodemic Meets Epidemic: a Systematic Literature Review

    Full text link
    Epidemics and outbreaks present arduous challenges requiring both individual and communal efforts. Social media offer significant amounts of data that can be leveraged for bio-surveillance. They also provide a platform to quickly and efficiently reach a sizeable percentage of the population, hence their potential impact on various aspects of epidemic mitigation. The general objective of this systematic literature review is to provide a methodical overview of the integration of social media in different epidemic-related contexts. Three research questions were conceptualized for this review, resulting in over 10000 publications collected in the first PRISMA stage, 129 of which were selected for inclusion. A thematic method-oriented synthesis was undertaken and identified 5 main themes related to social media enabled epidemic surveillance, misinformation management, and mental health. Findings uncover a need for more robust applications of the lessons learned from epidemic post-mortem documentation. A vast gap exists between retrospective analysis of epidemic management and result integration in prospective studies. Harnessing the full potential of social media in epidemic related tasks requires streamlining the results of epidemic forecasting, public opinion understanding and misinformation propagation, all while keeping abreast of potential mental health implications. Pro-active prevention has thus become vital for epidemic curtailment and containment

    MODELING TWITTER SENTIMENT AS A FUNCTION OF PARTICULATE MATTER 2.5 FOR COMMUNITIES IMPACTED BY WILDFIRE ACROSS MONTANA AND IDAHO

    Get PDF
    Fine particulate matter (PM2.5) is a known pollutant with clinically detrimental physiological and behavioral effects. We consider Twitter sentiment as a potential indicator for well-being in communities impacted by wildfire-associated PM2.5 across Montana and Idaho spanning 5 years (2014-2018). From these geospatial air quality data and geo-tagged tweets, we trained county level models to examine the power of Twitter sentiment as a function of PM2.5. For all 24 counties sampled, we found between 1 and 8 affective dimensions where a positive 2 was detected with a significant F-statistic ( \u3c 0.05). Specifically, we show that sentiment for anticipation in the wildfire-prone county of Missoula, MT yielded respective training/test set 2 of 0.0958 and 0.0686 with a p-value for the F-statistic of 3.09E-07. These analyses support social media sentiment as a potential public health metric by showing one of the first observations of a relationship between PM2.5 and Twitter sentiment

    Studying Public Perception about Vaccination: A Sentiment Analysis of Tweets

    Full text link
    Text analysis has been used by scholars to research attitudes toward vaccination and is particularly timely due to the rise of medical misinformation via social media. This study uses a sample of 9581 vaccine-related tweets in the period 1 January 2019 to 5 April 2019. The time period is of the essence because during this time, a measles outbreak was prevalent throughout the United States and a public debate was raging. Sentiment analysis is applied to the sample, clustering the data into topics using the term frequency–inverse document frequency (TF-IDF) technique. The analyses suggest that most (about 77%) of the tweets focused on the search for new/better vaccines for diseases such as the Ebola virus, human papillomavirus (HPV), and the flu. Of the remainder, about half concerned the recent measles outbreak in the United States, and about half were part of ongoing debates between supporters and opponents of vaccination against measles in particular. While these numbers currently suggest a relatively small role for vaccine misinformation, the concept of herd immunity puts that role in context. Nevertheless, going forward, health experts should consider the potential for the increasing spread of falsehoods that may get firmly entrenched in the public mind

    The Propaganda Conundrum: How to Control This Scourge on Democracy

    Get PDF
    60 pagesPropaganda is playing an unprecedented role in global political life. With frightening reach and ambition, political and corporate actors are using propaganda to undermine the democratic ideals of truth and transparency. Because freedom of speech is a basic right that enjoys widespread public support, and as meaningful restrictions on noxious propaganda present legal difficulties, propaganda continues to flourish as a subtle and increasingly pervasive disease, undermining the core assumptions of democratic governance. Political choices citizens make in a democratic society mean little in the absence of true and accurate information; propaganda subverts the vital link between political understanding and political choice

    Quantitative intersectional data (QUINTA): a #metoo case study

    Get PDF
    This research began as an investigation of the #metoo movement, with the initial impetus to illuminate the voices located on the margins, those who often go unheard or are never recognized. This work aimed to understand the intersectional aspects of how these hashtag variations of the hashtag #metoo (i.e. #metoomosque, #churchtoo, #metoodisable, #metooqueer, #metoochina, etc) reveal the inequities of the #metoo movement on Twitter. The proliferation of these hashtag variations has often been ignored by scholars, and therefore absorbed into the larger #metoo movement conversation on Twitter. Therefore, the term `hashtag derivative\u27 was created to describe the variation on the theme of its original hashtag, strongly reflecting its composition. Moreover, a critical theory such as Intersectionality is well-equipped to explore how overlapping identities encounter structure social reality relationship to power. Amid a pandemic and racial unrest, the true capabilities of Intersectionality to describe inequities and injustices beyond the singular social position of race and gender are not widely understood. Data science, is not absolved of its role in inequities and injustices merely by dint of being a quantitative field that claims to ``objectivity\u27\u27. Social scientists have illuminated the racism, sexism, ableism, transphobia, homophobia, prejudice, bigotry, and bias embedded in data science\u27s technology, tools, and algorithms. This has, direct and indirectly, grave consequences on an entire community as a whole as well as marginalized communities. The application of Intersectionality into a quantitative field can provide researchers a formal structure to be more conscientious about how to critique, develop, and design their data science processes, while also reckoning with their own positioning in relationship to the data. In this way, Intersectionality is inclusive in terms of data equity yet adds an additional layer of accountability to the researcher. This research leads to the three critical contributions of this work: (1) creating a more concise terminology to describe the phenomenon of hashtag variation, known as hashtag derivatives, (2) defining the historical context of Intersectionality and building a formal case for this to be properly contextualized in the Computer Science field (in particular Data Science), and (3) developing the Quantitative Intersectional Data (QUINTA) Framework which data scientists and scholars can use to be more equitable, inclusive and accountable for their role in the data science process
    corecore