6 research outputs found
RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection
Online misinformation is often multimodal in nature, i.e., it is caused by
misleading associations between texts and accompanying images. To support the
fact-checking process, researchers have been recently developing automatic
multimodal methods that gather and analyze external information, evidence,
related to the image-text pairs under examination. However, prior works assumed
all external information collected from the web to be relevant. In this study,
we introduce a "Relevant Evidence Detection" (RED) module to discern whether
each piece of evidence is relevant, to support or refute the claim.
Specifically, we develop the "Relevant Evidence Detection Directed Transformer"
(RED-DOT) and explore multiple architectural variants (e.g., single or
dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and
comparative experiments demonstrate that RED-DOT achieves significant
improvements over the state-of-the-art (SotA) on the VERITE benchmark by up to
33.7%. Furthermore, our evidence re-ranking and element-wise modality fusion
led to RED-DOT surpassing the SotA on NewsCLIPings+ by up to 3% without the
need for numerous evidence or multiple backbone encoders. We release our code
at: https://github.com/stevejpapad/relevant-evidence-detectio
Credible, Unreliable or Leaked?: Evidence Verification for Enhanced Automated Fact-checking
Automated fact-checking (AFC) is garnering increasing attention by
researchers aiming to help fact-checkers combat the increasing spread of
misinformation online. While many existing AFC methods incorporate external
information from the Web to help examine the veracity of claims, they often
overlook the importance of verifying the source and quality of collected
"evidence". One overlooked challenge involves the reliance on "leaked
evidence", information gathered directly from fact-checking websites and used
to train AFC systems, resulting in an unrealistic setting for early
misinformation detection. Similarly, the inclusion of information from
unreliable sources can undermine the effectiveness of AFC systems. To address
these challenges, we present a comprehensive approach to evidence verification
and filtering. We create the "CREDible, Unreliable or LEaked" (CREDULE)
dataset, which consists of 91,632 articles classified as Credible, Unreliable
and Fact checked (Leaked). Additionally, we introduce the EVidence VERification
Network (EVVER-Net), trained on CREDULE to detect leaked and unreliable
evidence in both short and long texts. EVVER-Net can be used to filter evidence
collected from the Web, thus enhancing the robustness of end-to-end AFC
systems. We experiment with various language models and show that EVVER-Net can
demonstrate impressive performance of up to 91.5% and 94.4% accuracy, while
leveraging domain credibility scores along with short or long texts,
respectively. Finally, we assess the evidence provided by widely-used
fact-checking datasets including LIAR-PLUS, MOCHEG, FACTIFY, NewsCLIPpings+ and
VERITE, some of which exhibit concerning rates of leaked and unreliable
evidence
VERITE: A Robust Benchmark for Multimodal Misinformation Detection Accounting for Unimodal Bias
Multimedia content has become ubiquitous on social media platforms, leading
to the rise of multimodal misinformation (MM) and the urgent need for effective
strategies to detect and prevent its spread. In recent years, the challenge of
multimodal misinformation detection (MMD) has garnered significant attention by
researchers and has mainly involved the creation of annotated, weakly
annotated, or synthetically generated training datasets, along with the
development of various deep learning MMD models. However, the problem of
unimodal bias in MMD benchmarks -- where biased or unimodal methods outperform
their multimodal counterparts on an inherently multimodal task -- has been
overlooked. In this study, we systematically investigate and identify the
presence of unimodal bias in widely-used MMD benchmarks (VMU-Twitter, COSMOS),
raising concerns about their suitability for reliable evaluation. To address
this issue, we introduce the "VERification of Image-TExtpairs" (VERITE)
benchmark for MMD which incorporates real-world data, excludes "asymmetric
multimodal misinformation" and utilizes "modality balancing". We conduct an
extensive comparative study with a Transformer-based architecture that shows
the ability of VERITE to effectively address unimodal bias, rendering it a
robust evaluation framework for MMD. Furthermore, we introduce a new method --
termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating
realistic synthetic training data that preserve crossmodal relations between
legitimate images and false human-written captions. By leveraging CHASMA in the
training process, we observe consistent and notable improvements in predictive
performance on VERITE; with a 9.2% increase in accuracy. We release our code
at: https://github.com/stevejpapad/image-text-verificatio
VICTOR: Visual Incompatibility Detection with Transformers and Fashion-specific contrastive pre-training
In order to consider fashion outfits as aesthetically pleasing, the garments that constitute them need to be compatible in terms of visual aspects, such as style, category and color. With the advent and omnipresence of computer vision deep learning models, increased interest has also emerged for the task of visual compatibility detection with the aim to develop quality fashion outfit recommendation systems. Previous works have defined visual compatibility as a binary classification task with items in a garment being considered as fully compatible or fully incompatible. However, this is not applicable to Outfit Maker applications where users create their own outfits and need to know which specific items may be incompatible with the rest of the outfit. To address this, we propose the Visual InCompatibility TransfORmer (VICTOR) that is optimized for two tasks: 1) overall compatibility as regression and 2) the detection of mismatching items. Unlike previous works that either rely on feature extraction from ImageNet-pretrained models or by end-to-end fine tuning, we utilize fashion-specific contrastive language-image pre-training for fine tuning computer vision neural networks on fashion imagery. Moreover, we build upon the Polyvore outfit benchmark to generate partially mismatching outfits, creating a new dataset termed Polyvore-MISFITs, that is used to train VICTOR. A series of ablation and comparative analyses show that the proposed architecture can compete and even surpass the current state-of-the-art on Polyvore datasets while reducing the instance-wise floating operations by 88%, striking a balance between high performance and efficiency