6,427 research outputs found
Channel and spatial attention mechanism for fashion image captioning
Image captioning aims to automatically generate one or more description sentences for a given input image. Most of the existing captioning methods use encoder-decoder model which mainly focus on recognizing and capturing the relationship between objects appearing in the input image. However, when generating captions for fashion images, it is important to not only describe the items and their relationships, but also mention attribute features of clothes (shape, texture, style, fabric, and more). In this study, one novel model is proposed for fashion image captioning task which can capture not only the items and their relationship, but also their attribute features. Two different attention mechanisms (spatial-attention and channel-wise attention) is incorporated to the traditional encoder-decoder model, which dynamically interprets the caption sentence in multi-layer feature map in addition to the depth dimension of the feature map. We evaluate our proposed architecture on Fashion-Gen using three different metrics (CIDEr, ROUGE-L, and BLEU-1), and achieve the scores of 89.7, 50.6 and 45.6, respectively. Based on experiments, our proposed method shows significant performance improvement for the task of fashion-image captioning, and outperforms other state-of-the-art image captioning methods
Improving Domain Generalization by Learning without Forgetting: Application in Retail Checkout
Designing an automatic checkout system for retail stores at the human level
accuracy is challenging due to similar appearance products and their various
poses. This paper addresses the problem by proposing a method with a two-stage
pipeline. The first stage detects class-agnostic items, and the second one is
dedicated to classify product categories. We also track the objects across
video frames to avoid duplicated counting. One major challenge is the domain
gap because the models are trained on synthetic data but tested on the real
images. To reduce the error gap, we adopt domain generalization methods for the
first-stage detector. In addition, model ensemble is used to enhance the
robustness of the 2nd-stage classifier. The method is evaluated on the AI City
challenge 2022 -- Track 4 and gets the F1 score on the test A set. Code
is released at the link https://github.com/cybercore-co-ltd/aicity22-track4
- …