16 research outputs found
LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding
Deductive coding is a widely used qualitative research method for determining
the prevalence of themes across documents. While useful, deductive coding is
often burdensome and time consuming since it requires researchers to read,
interpret, and reliably categorize a large body of unstructured text documents.
Large language models (LLMs), like ChatGPT, are a class of quickly evolving AI
tools that can perform a range of natural language processing and reasoning
tasks. In this study, we explore the use of LLMs to reduce the time it takes
for deductive coding while retaining the flexibility of a traditional content
analysis. We outline the proposed approach, called LLM-assisted content
analysis (LACA), along with an in-depth case study using GPT-3.5 for LACA on a
publicly available deductive coding data set. Additionally, we conduct an
empirical benchmark using LACA on 4 publicly available data sets to assess the
broader question of how well GPT-3.5 performs across a range of deductive
coding tasks. Overall, we find that GPT-3.5 can often perform deductive coding
at levels of agreement comparable to human coders. Additionally, we demonstrate
that LACA can help refine prompts for deductive coding, identify codes for
which an LLM is randomly guessing, and help assess when to use LLMs vs. human
coders for deductive coding. We conclude with several implications for future
practice of deductive coding and related research methods
Massive Multi-Agent Data-Driven Simulations of the GitHub Ecosystem
Simulating and predicting planetary-scale techno-social systems poses heavy
computational and modeling challenges. The DARPA SocialSim program set the
challenge to model the evolution of GitHub, a large collaborative
software-development ecosystem, using massive multi-agent simulations. We
describe our best performing models and our agent-based simulation framework,
which we are currently extending to allow simulating other planetary-scale
techno-social systems. The challenge problem measured participant's ability,
given 30 months of meta-data on user activity on GitHub, to predict the next
months' activity as measured by a broad range of metrics applied to ground
truth, using agent-based simulation. The challenge required scaling to a
simulation of roughly 3 million agents producing a combined 30 million actions,
acting on 6 million repositories with commodity hardware. It was also important
to use the data optimally to predict the agent's next moves. We describe the
agent framework and the data analysis employed by one of the winning teams in
the challenge. Six different agent models were tested based on a variety of
machine learning and statistical methods. While no single method proved the
most accurate on every metric, the broadly most successful sampled from a
stationary probability distribution of actions and repositories for each agent.
Two reasons for the success of these agents were their use of a distinct
characterization of each agent, and that GitHub users change their behavior
relatively slowly
Stat5 Synergizes with T Cell Receptor/Antigen Stimulation in the Development of Lymphoblastic Lymphoma
Signal transducer and activator of transcription (STAT) proteins are latent transcription factors that mediate a wide range of actions induced by cytokines, interferons, and growth factors. We now report the development of thymic T cell lymphoblastic lymphomas in transgenic mice in which Stat5a or Stat5b is overexpressed within the lymphoid compartment. The rate of lymphoma induction was markedly enhanced by immunization or by the introduction of TCR transgenes. Remarkably, the Stat5 transgene potently induced development of CD8+ T cells, even in mice expressing a class II–restricted TCR transgene, with resulting CD8+ T cell lymphomas. These data demonstrate the oncogenic potential of dysregulated expression of a STAT protein that is not constitutively activated, and that TCR stimulation can contribute to this process
CoVaxxy Tweet IDs data set
A collection of Tweet IDs related to Covid-19 Vaccines, gathered from Twitter since Jan 4, 2021. Please see https://arxiv.org/abs/2101.07694 for more information
CoVaxxy Tweet IDs data set
A collection of Tweet IDs related to Covid-19 Vaccines, gathered from Twitter since Jan 4, 2021. Please see https://arxiv.org/abs/2101.07694 for more information
CoVaxxy Tweet IDs data set
A collection of Tweet IDs related to Covid-19 Vaccines, gathered from Twitter since Jan 4, 2021. Please see https://arxiv.org/abs/2101.07694 for more information
CoVaxxy Tweet IDs data set
A collection of Tweet IDs related to Covid-19 Vaccines, gathered from Twitter since Jan 4, 2021. Please see https://arxiv.org/abs/2101.07694 for more information
CoVaxxy: A Collection of English-Language Twitter Posts About COVID-19 Vaccines
With a substantial proportion of the population currently hesitant to take the COVID-19 vaccine, it is important that people have access to accurate information. However, there is a large amount of low-credibility information about vaccines spreading on social media. In this paper, we present the CoVaxxy dataset, a growing collection of English-language Twitter posts about COVID-19 vaccines. Using one week of data, we provide statistics regarding the numbers of tweets over time, the hashtags used, and the websites shared. We also illustrate how these data might be utilized by performing an analysis of the prevalence over time of high- and low-credibility sources, topic groups of hashtags, and geographical distributions. Additionally, we develop and present the CoVaxxy dashboard, allowing people to visualize the relationship between COVID-19 vaccine adoption and U.S. geo-located posts in our dataset. This dataset can be used to study the impact of online information on COVID-19 health outcomes (e.g., vaccine uptake) and our dashboard can help with exploration of the data