12 research outputs found
Deep Reinforcement Learning-based Image Captioning with Embedding Reward
Image captioning is a challenging problem owing to the complexity in
understanding the image content and diverse ways of describing it in natural
language. Recent advances in deep neural networks have substantially improved
the performance of this task. Most state-of-the-art approaches follow an
encoder-decoder framework, which generates captions using a sequential
recurrent prediction model. However, in this paper, we introduce a novel
decision-making framework for image captioning. We utilize a "policy network"
and a "value network" to collaboratively generate captions. The policy network
serves as a local guidance by providing the confidence of predicting the next
word according to the current state. Additionally, the value network serves as
a global and lookahead guidance by evaluating all possible extensions of the
current state. In essence, it adjusts the goal of predicting the correct words
towards the goal of generating captions similar to the ground truth captions.
We train both networks using an actor-critic reinforcement learning model, with
a novel reward defined by visual-semantic embedding. Extensive experiments and
analyses on the Microsoft COCO dataset show that the proposed framework
outperforms state-of-the-art approaches across different evaluation metrics
Multi-modal embedding for main product detection in fashion
© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Best Paper Award a la 2017 IEEE International Conference on Computer Vision WorkshopsWe present an approach to detect the main product in fashion images by exploiting the textual metadata associated with each image. Our approach is based on a Convolutional Neural Network and learns a joint embedding of object proposals and textual metadata to predict the main product in the image. We additionally use several complementary classification and overlap losses in order to improve training stability and performance. Our tests on a large-scale dataset taken from eight e-commerce sites show that our approach outperforms strong baselines and is able to accurately detect the main product in a wide diversity of challenging fashion images.Peer ReviewedAward-winningPostprint (author's final draft