Multimodal image-text classification plays a critical role in applications such as content moderation, news recommendation, and multimedia understanding. Despite recent advances, visual modality faces higher representation learning complexity than textual modality in semantic extraction, which often leads to a semantic gap between visual and textual representations. In addition, conventional fusion strategies introduce cross-modal redundancy, further limiting classification performance. To address
these issues,we proposeMD-MLLM, a novel image-text classification framework that leverages large multimodal language models (MLLMs) to generate semantically enhanced visual representations.
To mitigate redundancy introduced by direct MLLM feature integration, we introduce a hierarchical disentanglement mechanism based on the Hilbert-Schmidt Independence Criterion (HSIC) and orthogonality constraints, which explicitly separates modality-specific and shared representations. Furthermore, a hierarchical fusion strategy combines original unimodal features with disentangled shared semantics, promoting discriminative feature learning and cross-modal complementarity. Extensive experiments on two benchmark datasets, N24News and Food101, show that MD-MLLM achieves consistently stable improvements in classification accuracy and exhibits competitive performance compared with various representative multimodal baselines. The framework also demonstrates good generalization ability and robustness across different multimodal scenarios. The code is available at https://github.com/xiaohaochen0308/MD-MLLM
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.