1 research outputs found

    MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval

    No full text
    MultiClaim: Multilingual Previously Fact-Checked Claim Retrieval - is a dataset that can be used to train a test models used for disinformation combatting. The dataset consists of 206k claims fact-checked by professional fact-checkers and 28k social media posts gathered from the wild. Each social media post has at least on claim assigned. The main idea is to develop information retrieval models that will assign appropriate claims to all the posts. GitHub repository: https://github.com/kinit-sk/multiclaim Contents fact_check_post_mapping.csv - Mapping between fact checks and social media posts: fact_check_id post_id fact_checks.csv - Data about fact-checks: fact_check_id claim - This is the translated text (see below) of the fact-check claim instances - Instances of the fact-check – a list of timestamps and URLs. title - This is the translated text (see below) of the fact-check title posts.csv - Data about social media posts: post_id instances - Instances of the fact-check – a list of timestamps and what were the social media platforms. ocr - This is a list of translated texts (see below) of the OCR transcripts based on the images attached to the post. verdicts - This is a list of verdicts attached by Meta (e.g., False information) text - This is the translated text (see below) of the text written by the user. What is a translated text? A tuple of text, its translation to English and detected languages, e.g., in the sample below we have an original Croatian text, its translation to English and finally the predicted language composition (hbs = Serbo-Croatian): ( '"...bolnice su pune ? ti ina, muk...upravo sada, bolnica Rebro..tragi no sme no', '"...hospitals are full? silence, silence... right now, Rebro hospital... tragically funny', [('hbs', 1.0)] ) More details TB
    corecore