10,601 research outputs found
COMIC: Towards A Compact Image Captioning Model with Attention
Recent works in image captioning have shown very promising raw performance.
However, we realize that most of these encoder-decoder style networks with
attention do not scale naturally to large vocabulary size, making them
difficult to be deployed on embedded system with limited hardware resources.
This is because the size of word and output embedding matrices grow
proportionally with the size of vocabulary, adversely affecting the compactness
of these networks. To address this limitation, this paper introduces a brand
new idea in the domain of image captioning. That is, we tackle the problem of
compactness of image captioning models which is hitherto unexplored. We showed
that, our proposed model, named COMIC for COMpact Image Captioning, achieves
comparable results in five common evaluation metrics with state-of-the-art
approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an
embedding vocabulary size that is 39x - 99x smaller. The source code and models
are available at:
https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-AttentionComment: Added source code link and new results in Table
Fine-graind Image Classification via Combining Vision and Language
Fine-grained image classification is a challenging task due to the large
intra-class variance and small inter-class variance, aiming at recognizing
hundreds of sub-categories belonging to the same basic-level category. Most
existing fine-grained image classification methods generally learn part
detection models to obtain the semantic parts for better classification
accuracy. Despite achieving promising results, these methods mainly have two
limitations: (1) not all the parts which obtained through the part detection
models are beneficial and indispensable for classification, and (2)
fine-grained image classification requires more detailed visual descriptions
which could not be provided by the part locations or attribute annotations. For
addressing the above two limitations, this paper proposes the two-stream model
combining vision and language (CVL) for learning latent semantic
representations. The vision stream learns deep representations from the
original visual information via deep convolutional neural network. The language
stream utilizes the natural language descriptions which could point out the
discriminative parts or characteristics for each image, and provides a flexible
and compact way of encoding the salient visual aspects for distinguishing
sub-categories. Since the two streams are complementary, combining the two
streams can further achieves better classification accuracy. Comparing with 12
state-of-the-art methods on the widely used CUB-200-2011 dataset for
fine-grained image classification, the experimental results demonstrate our CVL
approach achieves the best performance.Comment: 9 pages, to appear in CVPR 201
- …