7 research outputs found
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization
The core for tackling the fine-grained visual categorization (FGVC) is to
learn subtle yet discriminative features. Most previous works achieve this by
explicitly selecting the discriminative parts or integrating the attention
mechanism via CNN-based approaches.However, these methods enhance the
computational complexity and make the modeldominated by the regions containing
the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA
performance on general image recognition tasks. Theself-attention mechanism
aggregates and weights the information from all patches to the classification
token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation
token in the deep layer pays more attention to the global information, lacking
the local and low-level features that are essential for FGVC. In this work, we
proposea novel pure transformer-based framework Feature Fusion Vision
Transformer (FFVT)where we aggregate the important tokens from each transformer
layer to compensate thelocal, low-level and middle-level information. We design
a novel token selection mod-ule called mutual attention weight selection (MAWS)
to guide the network effectively and efficiently towards selecting
discriminative tokens without introducing extra param-eters. We verify the
effectiveness of FFVT on three benchmarks where FFVT achieves the
state-of-the-art performance.Comment: 9 pages, 2 figures, 3 table
Learning Attentive Pairwise Interaction for Fine-Grained Classification
Fine-grained classification is a challenging problem, due to subtle
differences among highly-confused categories. Most approaches address this
difficulty by learning discriminative representation of individual input image.
On the other hand, humans can effectively identify contrastive clues by
comparing image pairs. Inspired by this fact, this paper proposes a simple but
effective Attentive Pairwise Interaction Network (API-Net), which can
progressively recognize a pair of fine-grained images by interaction.
Specifically, API-Net first learns a mutual feature vector to capture semantic
differences in the input pair. It then compares this mutual vector with
individual vectors to generate gates for each input image. These distinct gate
vectors inherit mutual context on semantic differences, which allow API-Net to
attentively capture contrastive clues by pairwise interaction between two
images. Additionally, we train API-Net in an end-to-end manner with a score
ranking regularization, which can further generalize API-Net by taking feature
priorities into account. We conduct extensive experiments on five popular
benchmarks in fine-grained classification. API-Net outperforms the recent SOTA
methods, i.e., CUB-200-2011 (90.0%), Aircraft(93.9%), Stanford Cars (95.3%),
Stanford Dogs (90.3%), and NABirds (88.1%).Comment: Accepted at AAAI-202