13 research outputs found
Learning Attentive Pairwise Interaction for Fine-Grained Classification
Fine-grained classification is a challenging problem, due to subtle
differences among highly-confused categories. Most approaches address this
difficulty by learning discriminative representation of individual input image.
On the other hand, humans can effectively identify contrastive clues by
comparing image pairs. Inspired by this fact, this paper proposes a simple but
effective Attentive Pairwise Interaction Network (API-Net), which can
progressively recognize a pair of fine-grained images by interaction.
Specifically, API-Net first learns a mutual feature vector to capture semantic
differences in the input pair. It then compares this mutual vector with
individual vectors to generate gates for each input image. These distinct gate
vectors inherit mutual context on semantic differences, which allow API-Net to
attentively capture contrastive clues by pairwise interaction between two
images. Additionally, we train API-Net in an end-to-end manner with a score
ranking regularization, which can further generalize API-Net by taking feature
priorities into account. We conduct extensive experiments on five popular
benchmarks in fine-grained classification. API-Net outperforms the recent SOTA
methods, i.e., CUB-200-2011 (90.0%), Aircraft(93.9%), Stanford Cars (95.3%),
Stanford Dogs (90.3%), and NABirds (88.1%).Comment: Accepted at AAAI-202
Feature Fusion Vision Transformer for Fine-Grained Visual Categorization
The core for tackling the fine-grained visual categorization (FGVC) is to
learn subtle yet discriminative features. Most previous works achieve this by
explicitly selecting the discriminative parts or integrating the attention
mechanism via CNN-based approaches.However, these methods enhance the
computational complexity and make the modeldominated by the regions containing
the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA
performance on general image recognition tasks. Theself-attention mechanism
aggregates and weights the information from all patches to the classification
token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation
token in the deep layer pays more attention to the global information, lacking
the local and low-level features that are essential for FGVC. In this work, we
proposea novel pure transformer-based framework Feature Fusion Vision
Transformer (FFVT)where we aggregate the important tokens from each transformer
layer to compensate thelocal, low-level and middle-level information. We design
a novel token selection mod-ule called mutual attention weight selection (MAWS)
to guide the network effectively and efficiently towards selecting
discriminative tokens without introducing extra param-eters. We verify the
effectiveness of FFVT on three benchmarks where FFVT achieves the
state-of-the-art performance.Comment: 9 pages, 2 figures, 3 table
Local Style Awareness of Font Images
When we compare fonts, we often pay attention to styles of local parts, such
as serifs and curvatures. This paper proposes an attention mechanism to find
important local parts. The local parts with larger attention are then
considered important. The proposed mechanism can be trained in a
quasi-self-supervised manner that requires no manual annotation other than
knowing that a set of character images is from the same font, such as
Helvetica. After confirming that the trained attention mechanism can find
style-relevant local parts, we utilize the resulting attention for local
style-aware font generation. Specifically, we design a new reconstruction loss
function to put more weight on the local parts with larger attention for
generating character images with more accurate style realization. This loss
function has the merit of applicability to various font generation models. Our
experimental results show that the proposed loss function improves the quality
of generated character images by several few-shot font generation models.Comment: Accepted at ICDAR WML 202
Your "Flamingo" is My "Bird": Fine-Grained, or Not
Whether what you see in Figure 1 is a "flamingo" or a "bird", is the question
we ask in this paper. While fine-grained visual classification (FGVC) strives
to arrive at the former, for the majority of us non-experts just "bird" would
probably suffice. The real question is therefore -- how can we tailor for
different fine-grained definitions under divergent levels of expertise. For
that, we re-envisage the traditional setting of FGVC, from single-label
classification, to that of top-down traversal of a pre-defined coarse-to-fine
label hierarchy -- so that our answer becomes
"bird"-->"Phoenicopteriformes"-->"Phoenicopteridae"-->"flamingo". To approach
this new problem, we first conduct a comprehensive human study where we confirm
that most participants prefer multi-granularity labels, regardless whether they
consider themselves experts. We then discover the key intuition that:
coarse-level label prediction exacerbates fine-grained feature learning, yet
fine-level feature betters the learning of coarse-level classifier. This
discovery enables us to design a very simple albeit surprisingly effective
solution to our new problem, where we (i) leverage level-specific
classification heads to disentangle coarse-level features with fine-grained
ones, and (ii) allow finer-grained features to participate in coarser-grained
label predictions, which in turn helps with better disentanglement. Experiments
show that our method achieves superior performance in the new FGVC setting, and
performs better than state-of-the-art on traditional single-label FGVC problem
as well. Thanks to its simplicity, our method can be easily implemented on top
of any existing FGVC frameworks and is parameter-free.Comment: Accepted as an oral of CVPR2021. Code:
https://github.com/PRIS-CV/Fine-Grained-or-No
Truck model recognition for an automatic overload detection system based on the improved MMAL-Net
Efficient and reliable transportation of goods through trucks is crucial for road logistics. However, the overloading of trucks poses serious challenges to road infrastructure and traffic safety. Detecting and preventing truck overloading is of utmost importance for maintaining road conditions and ensuring the safety of both road users and goods transported. This paper introduces a novel method for detecting truck overloading. The method utilizes the improved MMAL-Net for truck model recognition. Vehicle identification involves using frontal and side truck images, while APPM is applied for local segmentation of the side image to recognize individual parts. The proposed method analyzes the captured images to precisely identify the models of trucks passing through automatic weighing stations on the highway. The improved MMAL-Net achieved an accuracy of 95.03% on the competitive benchmark dataset, Stanford Cars, demonstrating its superiority over other established methods. Furthermore, our method also demonstrated outstanding performance on a small-scale dataset. In our experimental evaluation, our method achieved a recognition accuracy of 85% when the training set consisted of 20 sets of photos, and it reached 100% as the training set gradually increased to 50 sets of samples. Through the integration of this recognition system with weight data obtained from weighing stations and license plates information, the method enables real-time assessment of truck overloading. The implementation of the proposed method is of vital importance for multiple aspects related to road traffic safety
Exploring multi-subset learning techniques for fine-grained food image classification
Fine-grained image recognition (FGIR) is a fundamental and challenging problem within the field of computer vision that involves analyzing visual objects from subordinate categories, such as bird species or car models. The applications of FGIR are plentiful in both industry and research, ranging from automatic biodiversity monitoring to intelligent transportation. Recent advances in deep learning have paved the way for significant progress in this field. A recently proposed method is FGFR, a food-centered fine-grained recognition method that leverages a multitask architecture in which different heads or tasks specialize in discriminating between classes of automatically detected subsets of hard-to-distinguish classes. In this work, we provide an in-depth analysis of the behavior of FGFR and propose an improved version, FGFR+, which builds on top of the limitations we identify from our study of the original method. While we prove that FGFR is capable of generalizing to other non-food domains and different types of backbone architectures, we also observe that the method is not taking full advantage of its specialized multi-head structure. We find that, by implementing a series of conceptually simple modifications, the performance of the method can be significantly boosted, capitalizing on the fine-grained knowledge provided by the heads. FGFR+ achieves 94.2% top-1 validation accuracy on the Food-101 dataset, virtually ranking third in its corresponding benchmark. Being compatible with a wide range of deep learning computer vision backbone architectures, FGFR+ could have the potential of boosting the performance of many computer vision classification tasks