PhD ThesisObtaining annotated training data for supervised learning, is a bottleneck in many
contemporary machine learning applications. The increasing prevalence of multi-modal
and multi-view data creates both new opportunities for circumventing this issue, and
new application challenges. In this thesis we explore several approaches to alleviating
annotation issues in multi-view scenarios.
We start by studying the problem of zero-shot learning (ZSL) for image recognition,
where class-level annotations for image recognition are eliminated by transferring information
from text modality instead. We next look at cross-modal matching, where
paired instances across views provide the supervised label information for learning. We
develop methodology for unsupervised and semi-supervised learning of pairing, thus
eliminating the need for annotation requirements.
We rst apply these ideas to unsupervised multi-view matching in the context of
bilingual dictionary induction (BLI), where instances are words in two languages and
nding a correspondence between the words produces a cross-lingual word translation
model. We then return to vision and language and look at learning unsupervised pairing
between images and text. We will see that this can be seen as a limiting case of ZSL
where text-image pairing annotation requirements are completely eliminated.
Overall these contributions in multi-view learning provide a suite of methods for
reducing annotation requirements: both in conventional classi cation and cross-view
matching settings