111 research outputs found

    Z-Index at CheckThat! Lab 2022: Check-Worthiness Identification on Tweet Text

    Full text link
    The wide use of social media and digital technologies facilitates sharing various news and information about events and activities. Despite sharing positive information misleading and false information is also spreading on social media. There have been efforts in identifying such misleading information both manually by human experts and automatic tools. Manual effort does not scale well due to the high volume of information, containing factual claims, are appearing online. Therefore, automatically identifying check-worthy claims can be very useful for human experts. In this study, we describe our participation in Subtask-1A: Check-worthiness of tweets (English, Dutch and Spanish) of CheckThat! lab at CLEF 2022. We performed standard preprocessing steps and applied different models to identify whether a given text is worthy of fact checking or not. We use the oversampling technique to balance the dataset and applied SVM and Random Forest (RF) with TF-IDF representations. We also used BERT multilingual (BERT-m) and XLM-RoBERTa-base pre-trained models for the experiments. We used BERT-m for the official submissions and our systems ranked as 3rd, 5th, and 12th in Spanish, Dutch, and English, respectively. In further experiments, our evaluation shows that transformer models (BERT-m and XLM-RoBERTa-base) outperform the SVM and RF in Dutch and English languages where a different scenario is observed for Spanish.Comment: Accepted in CLEF 202

    Automated fact-checking: A survey

    Get PDF
    As online false information continues to grow, automated fact-checking has gained an increasing amount of attention in recent years. Researchers in the field of Natural Language Processing (NLP) have contributed to the task by building fact-checking datasets, devising automated fact-checking pipelines and proposing NLP methods to further research in the development of different components. This article reviews relevant research on automated fact-checking covering both the claim detection and claim validation components

    University of Copenhagen Participation in TREC Health Misinformation Track 2020

    Full text link
    In this paper, we describe our participation in the TREC Health Misinformation Track 2020. We submitted 1111 runs to the Total Recall Task and 13 runs to the Ad Hoc task. Our approach consists of 3 steps: (1) we create an initial run with BM25 and RM3; (2) we estimate credibility and misinformation scores for the documents in the initial run; (3) we merge the relevance, credibility and misinformation scores to re-rank documents in the initial run. To estimate credibility scores, we implement a classifier which exploits features based on the content and the popularity of a document. To compute the misinformation score, we apply a stance detection approach with a pretrained Transformer language model. Finally, we use different approaches to merge scores: weighted average, the distance among score vectors and rank fusion

    Few-shot Claim Verification for Automated Fact Checking

    Get PDF
    In an era characterized by the rapid expansion of online information and the widespread dissemination of misinformation, automated fact-checking has emerged as an essential area of research. As digital platforms continue to proliferate, the necessity for accurate and efficient fact-checking mechanisms is attracting increasing interest. Automated fact-checking systems address two main tasks: claim detection and claim validation. Claim detection involves identifying sentences or text snippets containing assertions or claims potentially subject to fact-checking. Claim validation, a multifaceted endeavor, encompasses evidence retrieval and claim verification. During evidence retrieval, relevant information or evidence that may support or refute a given claim is obtained. Claim verification, on the other hand, entails assessing the veracity of a claim by comparing it against available evidence. Typically framed as a natural language inference (NLI) problem, claim verification requires the model to determine whether a claim is supported, refuted, or there is not enough information to reach a verdict. In this thesis, we explore challenges inherent in claim verification, with a focus on few-shot scenarios where limited labeled data and computational resources pose significant constraints. We introduce three innovative methods tailored to tackle these challenges: Semantic Embedding Element-wise Difference (SEED), Micro Analysis of Pairwise Language Evolution (MAPLE), and Active learning with Pattern Exploiting Training models (Active PETs). SEED, a novel vector-based approach, leverages semantic differences in claim-evidence pairs to perform claim verification in few-shot scenarios. By creating class representative vectors, SEED enables efficient claim verification even with limited training data. Comparative evaluations against previous state-of-the-art methods demonstrate SEED's consistent improvements in few-shot settings. MAPLE is another pioneering approach to few-shot claim verification, harnessing a small seq2seq model and a novel semantic measure to explore the alignment between claims and evidence. Utilizing micro analysis of pairwise language evolution, MAPLE achieves significant performance improvements over state-of-the-art baselines across multiple automated fact-checking datasets. Active PETs presents a novel ensemble-based active learning approach for data annotation prioritization in few-shot claim verification. By utilizing an ensemble of Pattern Exploiting Training (PET) models based on various pre-trained language models, Active PETs effectively selects unlabelled data for annotation, consistently outperforming baseline active learning methods. Its integrated oversampling strategy further enhances performance, demonstrating the potential of active learning techniques in optimizing claim verification workflows. Together, these methods represent significant advancements in claim verification research, offering scalable and practical solutions. Through extensive experimentation and comparative analysis, this thesis evaluates the effectiveness of each method on various dataset configurations and provides valuable insights into their strengths and weaknesses. Furthermore, by identifying potential extensions and areas for refinement, the thesis lays the groundwork for future research endeavors in this critical field of artificial intelligence

    Overview of the CLAIMSCAN-2023: Uncovering Truth in Social Media through Claim Detection and Identification of Claim Spans

    Full text link
    A significant increase in content creation and information exchange has been made possible by the quick development of online social media platforms, which has been very advantageous. However, these platforms have also become a haven for those who disseminate false information, propaganda, and fake news. Claims are essential in forming our perceptions of the world, but sadly, they are frequently used to trick people by those who spread false information. To address this problem, social media giants employ content moderators to filter out fake news from the actual world. However, the sheer volume of information makes it difficult to identify fake news effectively. Therefore, it has become crucial to automatically identify social media posts that make such claims, check their veracity, and differentiate between credible and false claims. In response, we presented CLAIMSCAN in the 2023 Forum for Information Retrieval Evaluation (FIRE'2023). The primary objectives centered on two crucial tasks: Task A, determining whether a social media post constitutes a claim, and Task B, precisely identifying the words or phrases within the post that form the claim. Task A received 40 registrations, demonstrating a strong interest and engagement in this timely challenge. Meanwhile, Task B attracted participation from 28 teams, highlighting its significance in the digital era of misinformation

    MythQA: Query-Based Large-Scale Check-Worthy Claim Detection through Multi-Answer Open-Domain Question Answering

    Full text link
    Check-worthy claim detection aims at providing plausible misinformation to downstream fact-checking systems or human experts to check. This is a crucial step toward accelerating the fact-checking process. Many efforts have been put into how to identify check-worthy claims from a small scale of pre-collected claims, but how to efficiently detect check-worthy claims directly from a large-scale information source, such as Twitter, remains underexplored. To fill this gap, we introduce MythQA, a new multi-answer open-domain question answering(QA) task that involves contradictory stance mining for query-based large-scale check-worthy claim detection. The idea behind this is that contradictory claims are a strong indicator of misinformation that merits scrutiny by the appropriate authorities. To study this task, we construct TweetMythQA, an evaluation dataset containing 522 factoid multi-answer questions based on controversial topics. Each question is annotated with multiple answers. Moreover, we collect relevant tweets for each distinct answer, then classify them into three categories: "Supporting", "Refuting", and "Neutral". In total, we annotated 5.3K tweets. Contradictory evidence is collected for all answers in the dataset. Finally, we present a baseline system for MythQA and evaluate existing NLP models for each system component using the TweetMythQA dataset. We provide initial benchmarks and identify key challenges for future models to improve upon. Code and data are available at: https://github.com/TonyBY/Myth-QAComment: Accepted by SIGIR 202

    SCUoL at CheckThat! 2021: An AraBERT model for check-worthiness of Arabic tweets

    Get PDF
    Many people nowadays tend to explore social media to obtain news and find information about various events and activities. However, an abundance of misleading and false information is spreading every day for many purposes, dramatically impacting societies. Therefore, it is vitally important to identify false information on social media to help individuals distinguish the truth and protect communities from the harmful effects of false information. For this reason, determining which information has the priority to be scrutinized is a significant prior step that several studies have considered. In this paper, we have addressed Subtask-1A(Arabic) of CLEF2021 CheckThat! Lab. We have done that in two steps. The first involved pre-processing the provided dataset with text segmentation and tokenization. In the second step, we implemented different models on the Arabic tweets in order to binary classify them according to whether a specific tweet is worth being considered for fact-checking or not. We mainly compared two versions of the pre-trained AraBERT model with some of the traditional word encoding methods, including the Linear SVC model with TF-IDF. The results indicate that the AraBERTv2 version outperforms the other models. Consequently, we used it for our final submission, and we were ranked third among eight other participating teams

    Automatic fake news detection on Twitter

    Get PDF
    Nowadays, information is easily accessible online, from articles by reliable news agencies to reports from independent reporters, to extreme views published by unknown individuals. Moreover, social media platforms are becoming increasingly important in everyday life, where users can obtain the latest news and updates, share links to any information they want to spread, and post their own opinions. Such information may create difficulties for information consumers as they try to distinguish fake news from genuine news. Indeed, users may not be necessarily aware that the information they encounter is false and may not have the time and effort to fact-check all the claims and information they encounter online. With the amount of information created and shared daily, it is also not feasible for journalists to manually fact-check every published news article, sentence or tweet. Therefore, an automatic fact-checking system that identifies the check-worthy claims and tweets, and then fact-checks these identified check-worthy claims and tweets can help inform the public of fake news circulating online. Existing fake news detection systems mostly rely on the machine learning models’ computational power to automatically identify fake news. Some researchers have focused on extracting the semantic and contextual meaning from news articles, statements, and tweets. These methods aim to identify fake news by analysing the differences in writing style between fake news and factual news. On the other hand, some researchers investigated using social networks information to detect fake news accurately. These methods aim to distinguish fake news from factual news based on the spreading pattern of news, and the statistical information of the engaging users with the propagated news. In this thesis, we propose a novel end-to-end fake news detection framework that leverages both the textual features and social network features, which can be extracted from news, tweets, and their engaging users. Specifically, our proposed end-to-end framework is able to process a Twitter feed, identify check-worthy tweets and sentences using textual features and embedded entity features, and fact-check the claims using previously unexplored information, such as existing fake news collections and user network embeddings. Our ultimate aim is to rank tweets and claims based on their check-worthiness to focus the available computational power on fact-checking the tweets and claims that are important and potentially fake. In particular, we leverage existing fake news collections to identify recurring fake news, while we explore the Twitter users’ engagement with the check-worthy news to identify fake news that are spreading on Twitter. To identify fake news effectively, we first propose the fake news detection framework (FNDF), which consists of the check-worthiness identification phase and the fact-checking phase. These two phases are divided into three tasks: Phase 1 Task 1: check-worthiness identification task; Phase 2 Task 2: recurring fake news identification task; and Phase 2 Task 3: social network structure-assisted fake news detection task. We conduct experiments on two large publicly available datasets, namely the MM-COVID and the stance detection (SD) datasets. The experimental results show that our proposed framework, FNDF, can indeed identify fake news more effectively than the existing SOTA models, with 23.2% and 4.0% significant increases in F1 scores on the two tested datasets, respectively. To identify the check-worthy tweets and claims effectively, we incorporate embedded entities with language representations to form a vector representation of a given text, to identify if the text is check-worthy or not. We conduct experiments using three publicly available datasets, namely, the CLEF 2019, 2020 CheckThat! Lab check-worthy sentence detection dataset, and the CLEF 2021 CheckThat! Lab check-worthy tweets detection dataset. The experimental results show that combining entity representations and language model representations enhance the language model’s performance in identifying check-worthy tweets and sentences. Specifically, combining embedded entities with the language model results in as much as 177.6% increase in MAP on ranking check-worthy tweets,and a 92.9% increase in ranking check-worthy sentences. Moreover, we conduct an ablation study on the proposed end-to-end framework, FNDF, and show that including a model for identifying check-worthy tweets and claims in our end-to-end framework, can significantly increase the F1 score by as much as 14.7%, compared to not including this model in our framework. To identify recurring fake news effectively, we propose an ensemble model of the BM25 scores and the BERT language model. Experiments conducted on two datasets, namely, the WSDM Cup 2019 Fake News Challenge dataset, and the MM-COVID dataset. Experimental results show that enriching the BERT language model with the BM25 scores can help the BERT model identify fake news significantly more accurately by 4.4%. Moreover, the ablation study on the end-to-end fake news detection framework, FNDF, shows that including the identification of recurring fake news model in our proposed framework results in significant increase in terms of F1 score by as much as 15.5%, compared to not including this task in our framework. To leverage the user network structure in detecting fake news, we first obtain user embed- dings from unsupervised user network embeddings based on their friendship or follower connections on Twitter. Next, we use the user embeddings of the users who engaged with the news to represent a check-worthy tweet/claim, thus predicting whether it is fake news. Our results show that using user network embeddings to represent check-worthy tweets/sentences significantly outperforms the SOTA model, which uses language models to represent the tweets/sentences and complex networks requiring handcrafted features, by 12.0% in terms of the F1 score. Furthermore, including the user network assisted fake news detection model in our end-to-end framework, FNDF, significantly increase the F1 score by as much as 29.3%. Overall, this thesis shows that an end-to-end fake news detection framework, FNDF, that identifies check-worthy tweets and claims, then fact-checks the check-worthy tweets and claims, by identifying recurring fake news and leveraging the social network users’ connections, can effectively identify fake news online
    • …
    corecore