18 research outputs found
Entity-based Claim Representation Improves Fact-Checking of Medical Content in Tweets
False medical information on social media poses harm to people's health.
While the need for biomedical fact-checking has been recognized in recent
years, user-generated medical content has received comparably little attention.
At the same time, models for other text genres might not be reusable, because
the claims they have been trained with are substantially different. For
instance, claims in the SciFact dataset are short and focused: "Side effects
associated with antidepressants increases risk of stroke". In contrast, social
media holds naturally-occurring claims, often embedded in additional context:
"`If you take antidepressants like SSRIs, you could be at risk of a condition
called serotonin syndrome' Serotonin syndrome nearly killed me in 2010. Had
symptoms of stroke and seizure." This showcases the mismatch between real-world
medical claims and the input that existing fact-checking systems expect. To
make user-generated content checkable by existing models, we propose to
reformulate the social-media input in such a way that the resulting claim
mimics the claim characteristics in established datasets. To accomplish this,
our method condenses the claim with the help of relational entity information
and either compiles the claim out of an entity-relation-entity triple or
extracts the shortest phrase that contains these elements. We show that the
reformulated input improves the performance of various fact-checking models as
opposed to checking the tweet text in its entirety.Comment: Accepted at The 9th Workshop on Argument Minin
Recovering Patient Journeys: A Corpus of Biomedical Entities and Relations on Twitter (BEAR)
Text mining and information extraction for the medical domain has focused on
scientific text generated by researchers. However, their direct access to
individual patient experiences or patient-doctor interactions can be limited.
Information provided on social media, e.g., by patients and their relatives,
complements the knowledge in scientific text. It reflects the patient's journey
and their subjective perspective on the process of developing symptoms, being
diagnosed and offered a treatment, being cured or learning to live with a
medical condition. The value of this type of data is therefore twofold:
Firstly, it offers direct access to people's perspectives. Secondly, it might
cover information that is not available elsewhere, including self-treatment or
self-diagnoses. Named entity recognition and relation extraction are methods to
structure information that is available in unstructured text. However, existing
medical social media corpora focused on a comparably small set of entities and
relations and particular domains, rather than putting the patient into the
center of analyses. With this paper we contribute a corpus with a rich set of
annotation layers following the motivation to uncover and model patients'
journeys and experiences in more detail. We label 14 entity classes (incl.
environmental factors, diagnostics, biochemical processes, patients'
quality-of-life descriptions, pathogens, medical conditions, and treatments)
and 20 relation classes (e.g., prevents, influences, interactions, causes) most
of which have not been considered before for social media data. The publicly
available dataset consists of 2,100 tweets with approx. 6,000 entity and 3,000
relation annotations. In a corpus analysis we find that over 80 % of documents
contain relevant entities. Over 50 % of tweets express relations which we
consider essential for uncovering patients' narratives about their journeys.Comment: Accepted at LREC 202
An Entity-based Claim Extraction Pipeline for Real-world Biomedical Fact-checking
Existing fact-checking models for biomedical claims are typically trained on synthetic or well-worded data and hardly transfer to social media content. This mismatch can be mitigated by adapting the social media input to mimic the focused nature of common training claims. To do so, Wührl and Klinger (2022a) propose to extract concise claims based on medical entities in the text. However, their study has two limitations: First, it relies on gold-annotated entities. Therefore, its feasibility for a real-world application cannot be assessed since this requires detecting relevant entities automatically. Second, they represent claim entities with the original tokens. This constitutes a terminology mismatch which potentially limits the fact-checking performance. To understand both challenges, we propose a claim extraction pipeline for medical tweets that incorporates named entity recognition and terminology normalization via entity linking. We show that automatic NER does lead to a performance drop in comparison to using gold annotations but the fact-checking performance still improves considerably over inputting the unchanged tweets. Normalizing entities to their canonical forms does, however, not improve the performance
An Entity-based Claim Extraction Pipeline for Real-world Biomedical Fact-checking
Existing fact-checking models for biomedical claims are typically trained on
synthetic or well-worded data and hardly transfer to social media content. This
mismatch can be mitigated by adapting the social media input to mimic the
focused nature of common training claims. To do so, Wuehrl & Klinger (2022)
propose to extract concise claims based on medical entities in the text.
However, their study has two limitations: First, it relies on gold-annotated
entities. Therefore, its feasibility for a real-world application cannot be
assessed since this requires detecting relevant entities automatically. Second,
they represent claim entities with the original tokens. This constitutes a
terminology mismatch which potentially limits the fact-checking performance. To
understand both challenges, we propose a claim extraction pipeline for medical
tweets that incorporates named entity recognition and terminology normalization
via entity linking. We show that automatic NER does lead to a performance drop
in comparison to using gold annotations but the fact-checking performance still
improves considerably over inputting the unchanged tweets. Normalizing entities
to their canonical forms does, however, not improve the performance.Comment: Accepted at The Sixth FEVER Worksho
CoVERT : A Corpus of Fact-checked Biomedical COVID-19 Tweets
During the first two years of the COVID-19 pandemic, large volumes of biomedical information concerning this new disease have been published on social media. Some of this information can pose a real danger, particularly when false information is shared, for instance recommendations how to treat diseases without professional medical advice. Therefore, automatic fact-checking resources and systems developed specifically for medical domain are crucial. While existing fact-checking resources cover COVID-19 related information in news or quantify the amount of misinformation in tweets, there is no dataset providing fact-checked COVID-19 related Twitter posts with detailed annotations for biomedical entities, relations and relevant evidence. We contribute CoVERT, a fact-checked corpus of tweets with a focus on the domain of biomedicine and COVID-19 related (mis)information. The corpus consists of 300 tweets, each annotated with named entities and relations. We employ a novel crowdsourcing methodology to annotate all tweets with fact-checking labels and supporting evidence, which crowdworkers search for online. This methodology results in substantial inter-annotator agreement. Furthermore, we use the retrieved evidence extracts as part of a fact-checking pipeline, finding that the real-world evidence is more useful than the knowledge directly available in pretrained language models
Understanding Fine-grained Distortions in Reports of Scientific Findings
Distorted science communication harms individuals and society as it can lead
to unhealthy behavior change and decrease trust in scientific institutions.
Given the rapidly increasing volume of science communication in recent years, a
fine-grained understanding of how findings from scientific publications are
reported to the general public, and methods to detect distortions from the
original work automatically, are crucial. Prior work focused on individual
aspects of distortions or worked with unpaired data. In this work, we make
three foundational contributions towards addressing this problem: (1)
annotating 1,600 instances of scientific findings from academic papers paired
with corresponding findings as reported in news articles and tweets wrt. four
characteristics: causality, certainty, generality and sensationalism; (2)
establishing baselines for automatically detecting these characteristics; and
(3) analyzing the prevalence of changes in these characteristics in both
human-annotated and large-scale unlabeled data. Our results show that
scientific findings frequently undergo subtle distortions when reported. Tweets
distort findings more often than science news reports. Detecting fine-grained
distortions automatically poses a challenging task. In our experiments,
fine-tuned task-specific models consistently outperform few-shot LLM prompting
Recovering Patient Journeys : A Corpus of Biomedical Entities and Relations on Twitter (BEAR)
Text mining and information extraction for the medical domain has focused on scientific text generated by researchers. However, their access to individual patient experiences or patient-doctor interactions is limited. On social media, doctors, patients and their relatives also discuss medical information. Individual information provided by laypeople complements the knowledge available in scientific text. It reflects the patient’s journey making the value of this type of data twofold: It offers direct access to people’s perspectives, and it might cover information that is not available elsewhere, including self-treatment or self-diagnose. Named entity recognition and relation extraction are methods to structure information that is available in unstructured text. However, existing medical social media corpora focused on a comparably small set of entities and relations. In contrast, we provide rich annotation layers to model patients’ experiences in detail. The corpus consists of medical tweets annotated with a fine-grained set of medical entities and relations between them, namely 14 entity (incl. environmental factors, diagnostics, biochemical processes, patients’ quality-of-life descriptions, pathogens, medical conditions, and treatments) and 20 relation classes (incl. prevents, influences, interactions, causes). The dataset consists of 2,100 tweets with approx. 6,000 entities and 2,200 relations
An Entity-based Claim Extraction Pipeline for Real-world Biomedical Fact-checking
Existing fact-checking models for biomedical claims are typically trained on synthetic or well-worded data and hardly transfer to social media content. This mismatch can be mitigated by adapting the social media input to mimic the focused nature of common training claims. To do so, Wührl and Klinger (2022a) propose to extract concise claims based on medical entities in the text. However, their study has two limitations: First, it relies on gold-annotated entities. Therefore, its feasibility for a real-world application cannot be assessed since this requires detecting relevant entities automatically. Second, they represent claim entities with the original tokens. This constitutes a terminology mismatch which potentially limits the fact-checking performance. To understand both challenges, we propose a claim extraction pipeline for medical tweets that incorporates named entity recognition and terminology normalization via entity linking. We show that automatic NER does lead to a performance drop in comparison to using gold annotations but the fact-checking performance still improves considerably over inputting the unchanged tweets. Normalizing entities to their canonical forms does, however, not improve the performance
CoVERT : A Corpus of Fact-checked Biomedical COVID-19 Tweets
During the first two years of the COVID-19 pandemic, large volumes of biomedical information concerning this new disease have been published on social media. Some of this information can pose a real danger, particularly when false information is shared, for instance recommendations how to treat diseases without professional medical advice. Therefore, automatic fact-checking resources and systems developed specifically for medical domain are crucial. While existing fact-checking resources cover COVID-19 related information in news or quantify the amount of misinformation in tweets, there is no dataset providing fact-checked COVID-19 related Twitter posts with detailed annotations for biomedical entities, relations and relevant evidence. We contribute CoVERT, a fact-checked corpus of tweets with a focus on the domain of biomedicine and COVID-19 related (mis)information. The corpus consists of 300 tweets, each annotated with named entities and relations. We employ a novel crowdsourcing methodology to annotate all tweets with fact-checking labels and supporting evidence, which crowdworkers search for online. This methodology results in substantial inter-annotator agreement. Furthermore, we use the retrieved evidence extracts as part of a fact-checking pipeline, finding that the real-world evidence is more useful than the knowledge directly available in pretrained language models
Emotion Recognition under Consideration of the Emotion Component Process Model
Emotion classification in text is typically performed with neural network
models which learn to associate linguistic units with emotions. While this
often leads to good predictive performance, it does only help to a limited
degree to understand how emotions are communicated in various domains. The
emotion component process model (CPM) by Scherer (2005) is an interesting
approach to explain emotion communication. It states that emotions are a
coordinated process of various subcomponents, in reaction to an event, namely
the subjective feeling, the cognitive appraisal, the expression, a
physiological bodily reaction, and a motivational action tendency. We
hypothesize that these components are associated with linguistic realizations:
an emotion can be expressed by describing a physiological bodily reaction ("he
was trembling"), or the expression ("she smiled"), etc. We annotate existing
literature and Twitter emotion corpora with emotion component classes and find
that emotions on Twitter are predominantly expressed by event descriptions or
subjective reports of the feeling, while in literature, authors prefer to
describe what characters do, and leave the interpretation to the reader. We
further include the CPM in a multitask learning model and find that this
supports the emotion categorization. The annotated corpora are available at
https://www.ims.uni-stuttgart.de/data/emotion.Comment: KONVENS 2021, published at https://aclanthology.org/2021.konvens-1.5/
Please cite via https://aclanthology.org/2021.konvens-1.5.bi