Disentangled Image-Text Classification: Enhancing Visual Representations with MLLM-driven Knowledge Transfer

Shuai, Qianjun; Chen, Xiaohao; Cheng, Yongqiang; Fang, Miao; Jin, Libiao

oai:sure.sunderland.ac.uk:19756

Disentangled Image-Text Classification: Enhancing Visual Representations with MLLM-driven Knowledge Transfer

Authors: Qianjun Shuai
Xiaohao Chen
Yongqiang Cheng
Miao Fang
Libiao Jin
Publication date: 1 April 2026
Publisher: Elsevier
Doi

Abstract

Multimodal image-text classification plays a critical role in applications such as content moderation, news recommendation, and multimedia understanding. Despite recent advances, visual modality faces higher representation learning complexity than textual modality in semantic extraction, which often leads to a semantic gap between visual and textual representations. In addition, conventional fusion strategies introduce cross-modal redundancy, further limiting classification performance. To address these issues,we proposeMD-MLLM, a novel image-text classification framework that leverages large multimodal language models (MLLMs) to generate semantically enhanced visual representations. To mitigate redundancy introduced by direct MLLM feature integration, we introduce a hierarchical disentanglement mechanism based on the Hilbert-Schmidt Independence Criterion (HSIC) and orthogonality constraints, which explicitly separates modality-specific and shared representations. Furthermore, a hierarchical fusion strategy combines original unimodal features with disentangled shared semantics, promoting discriminative feature learning and cross-modal complementarity. Extensive experiments on two benchmark datasets, N24News and Food101, show that MD-MLLM achieves consistently stable improvements in classification accuracy and exhibits competitive performance compared with various representative multimodal baselines. The framework also demonstrates good generalization ability and robustness across different multimodal scenarios. The code is available at https://github.com/xiaohaochen0308/MD-MLLM

Similar works

Full text

Sunderland University Institutional Repository

oai:sure.sunderland.ac.uk:1975...

Last time updated on 29/12/2025

This paper was published in Sunderland University Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.