Weakly supervised learning from referring expression: Challenge and directions

Abstract

We explore methods of weakly supervised learning from referring expression. Unlike traditional fully supervised semantic segmentation of object recognition tasks, in which a a small set of discrete class bases is provided, the referring expression task is performed associated with a sentence phrase, e.g. “the dude on the dolphin”. Previous approaches use LSTM and fully convolutional network and have fairly good results under fully supervised setting. However, the fully supervised setting is limited by manual labeling of segmentation masks, which requires a significant amount of human labor. Therefore, we work on an approach to perform segmentation with only image level language descriptions. Under our weakly supervised setting, we are only provided with input images and the corresponding sentence descriptions, without the pixel level labeling for each image as ground truth. In order to get supervision only from language description, we utilize the multiple instance learning loss. We first develop an end-to-end model to localize the image content corresponding to the language expressions. In this model, we use GloVe and ELMo sentence embeddings to get a vector representation for each sentence and combined with image features from a fully convolutional network. However, the sentence level model is hard to interpret hence we also study a more fundamental problem of weakly supervised object localization from referring expressions. We compare the performance of the sentence level model on this task to an alternative word-level model. Our investigation suggests that breaking the referring expressions localization problem into smaller more manageable components is promising

    Similar works