Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation

Abstract

Abstract. Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation such as visual attributes or semantic word vectors. Such a semantic representation is shared between an annotated auxiliary dataset and a target dataset with no annotation. A projection from a low-level feature space to the seman-tic space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify an inher-ent limitation with this approach. That is, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the projection domain shift prob-lem and propose a novel framework, transductive multi-view embedding, to solve it. It is ‘transductive ’ in that unlabelled target data points are explored for projection adaptation, and ‘multi-view ’ in that both low-level feature (view) and multiple semantic representations (views) are embedded to rectify the projection shift. We demonstrate through ex-tensive experiments that our framework (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complemen-tarity of multiple semantic representations, (3) achieves state-of-the-art recognition results on image and video benchmark datasets, and (4) en-ables novel cross-view annotation tasks.

    Similar works