This paper investigates a challenging problem of zero-shot learning in the
multi-label scenario (MLZSL), wherein, the model is trained to recognize
multiple unseen classes within a sample (e.g., an image) based on seen classes
and auxiliary knowledge, e.g., semantic information. Existing methods usually
resort to analyzing the relationship of various seen classes residing in a
sample from the dimension of spatial or semantic characteristics, and transfer
the learned model to unseen ones. But they ignore the effective integration of
local and global features. That is, in the process of inferring unseen classes,
global features represent the principal direction of the image in the feature
space, while local features should maintain uniqueness within a certain range.
This integrated neglect will make the model lose its grasp of the main
components of the image. Relying only on the local existence of seen classes
during the inference stage introduces unavoidable bias. In this paper, we
propose a novel and effective group bi-enhancement framework for MLZSL, dubbed
GBE-MLZSL, to fully make use of such properties and enable a more accurate and
robust visual-semantic projection. Specifically, we split the feature maps into
several feature groups, of which each feature group can be trained
independently with the Local Information Distinguishing Module (LID) to ensure
uniqueness. Meanwhile, a Global Enhancement Module (GEM) is designed to
preserve the principal direction. Besides, a static graph structure is designed
to construct the correlation of local features. Experiments on large-scale
MLZSL benchmark datasets NUS-WIDE and Open-Images-v4 demonstrate that the
proposed GBE-MLZSL outperforms other state-of-the-art methods with large
margins.Comment: 11 pages, 8 figure