Search CORE

13 research outputs found

Learning Attentive Pairwise Interaction for Fine-Grained Classification

Author: Qiao Yu
Wang Yali
Zhuang Peiqin
Publication venue
Publication date: 24/02/2020
Field of study

Fine-grained classification is a challenging problem, due to subtle differences among highly-confused categories. Most approaches address this difficulty by learning discriminative representation of individual input image. On the other hand, humans can effectively identify contrastive clues by comparing image pairs. Inspired by this fact, this paper proposes a simple but effective Attentive Pairwise Interaction Network (API-Net), which can progressively recognize a pair of fine-grained images by interaction. Specifically, API-Net first learns a mutual feature vector to capture semantic differences in the input pair. It then compares this mutual vector with individual vectors to generate gates for each input image. These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images. Additionally, we train API-Net in an end-to-end manner with a score ranking regularization, which can further generalize API-Net by taking feature priorities into account. We conduct extensive experiments on five popular benchmarks in fine-grained classification. API-Net outperforms the recent SOTA methods, i.e., CUB-200-2011 (90.0%), Aircraft(93.9%), Stanford Cars (95.3%), Stanford Dogs (90.3%), and NABirds (88.1%).Comment: Accepted at AAAI-202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Author: Gao Yongsheng
Wang Jun
Yu Xiaohan
Publication venue
Publication date: 01/01/2021
Field of study

The core for tackling the fine-grained visual categorization (FGVC) is to learn subtle yet discriminative features. Most previous works achieve this by explicitly selecting the discriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to the classification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lacking the local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achieves the state-of-the-art performance.Comment: 9 pages, 2 figures, 3 table

arXiv.org e-Print Archive

Warwick Research Archives Portal Repository

Local Style Awareness of Font Images

Author: Haraguchi Daichi
Uchida Seiichi
Publication venue
Publication date: 10/10/2023
Field of study

When we compare fonts, we often pay attention to styles of local parts, such as serifs and curvatures. This paper proposes an attention mechanism to find important local parts. The local parts with larger attention are then considered important. The proposed mechanism can be trained in a quasi-self-supervised manner that requires no manual annotation other than knowing that a set of character images is from the same font, such as Helvetica. After confirming that the trained attention mechanism can find style-relevant local parts, we utilize the resulting attention for local style-aware font generation. Specifically, we design a new reconstruction loss function to put more weight on the local parts with larger attention for generating character images with more accurate style realization. This loss function has the merit of applicability to various font generation models. Our experimental results show that the proposed loss function improves the quality of generated character images by several few-shot font generation models.Comment: Accepted at ICDAR WML 202

arXiv.org e-Print Archive

Your "Flamingo" is My "Bird": Fine-Grained, or Not

Author: Chang Dongliang
Guo Jun
Ma Zhanyu
Pang Kaiyue
Song Yi-Zhe
Zheng Yixiao
Publication venue
Publication date: 28/03/2021
Field of study

Whether what you see in Figure 1 is a "flamingo" or a "bird", is the question we ask in this paper. While fine-grained visual classification (FGVC) strives to arrive at the former, for the majority of us non-experts just "bird" would probably suffice. The real question is therefore -- how can we tailor for different fine-grained definitions under divergent levels of expertise. For that, we re-envisage the traditional setting of FGVC, from single-label classification, to that of top-down traversal of a pre-defined coarse-to-fine label hierarchy -- so that our answer becomes "bird"-->"Phoenicopteriformes"-->"Phoenicopteridae"-->"flamingo". To approach this new problem, we first conduct a comprehensive human study where we confirm that most participants prefer multi-granularity labels, regardless whether they consider themselves experts. We then discover the key intuition that: coarse-level label prediction exacerbates fine-grained feature learning, yet fine-level feature betters the learning of coarse-level classifier. This discovery enables us to design a very simple albeit surprisingly effective solution to our new problem, where we (i) leverage level-specific classification heads to disentangle coarse-level features with fine-grained ones, and (ii) allow finer-grained features to participate in coarser-grained label predictions, which in turn helps with better disentanglement. Experiments show that our method achieves superior performance in the new FGVC setting, and performs better than state-of-the-art on traditional single-label FGVC problem as well. Thanks to its simplicity, our method can be easily implemented on top of any existing FGVC frameworks and is parameter-free.Comment: Accepted as an oral of CVPR2021. Code: https://github.com/PRIS-CV/Fine-Grained-or-No

arXiv.org e-Print Archive

University of Surrey

Truck model recognition for an automatic overload detection system based on the improved MMAL-Net

Author: Jiachen Sun
Jin Su
Lilan Liu
Yanning Sun
Zenggui Gao
Zhenhao Yan
Publication venue: Frontiers Media S.A.
Publication date: 01/08/2023
Field of study

Efficient and reliable transportation of goods through trucks is crucial for road logistics. However, the overloading of trucks poses serious challenges to road infrastructure and traffic safety. Detecting and preventing truck overloading is of utmost importance for maintaining road conditions and ensuring the safety of both road users and goods transported. This paper introduces a novel method for detecting truck overloading. The method utilizes the improved MMAL-Net for truck model recognition. Vehicle identification involves using frontal and side truck images, while APPM is applied for local segmentation of the side image to recognize individual parts. The proposed method analyzes the captured images to precisely identify the models of trucks passing through automatic weighing stations on the highway. The improved MMAL-Net achieved an accuracy of 95.03% on the competitive benchmark dataset, Stanford Cars, demonstrating its superiority over other established methods. Furthermore, our method also demonstrated outstanding performance on a small-scale dataset. In our experimental evaluation, our method achieved a recognition accuracy of 85% when the training set consisted of 20 sets of photos, and it reached 100% as the training set gradually increased to 50 sets of samples. Through the integration of this recognition system with weight data obtained from weighing stations and license plates information, the method enables real-time assessment of truck overloading. The implementation of the proposed method is of vital importance for multiple aspects related to road traffic safety

Directory of Open Access Journals

Exploring multi-subset learning techniques for fine-grained food image classification

Author: Villacorta Benito Pablo
Publication venue: Universitat Politècnica de Catalunya
Publication date: 25/01/2023
Field of study

Fine-grained image recognition (FGIR) is a fundamental and challenging problem within the field of computer vision that involves analyzing visual objects from subordinate categories, such as bird species or car models. The applications of FGIR are plentiful in both industry and research, ranging from automatic biodiversity monitoring to intelligent transportation. Recent advances in deep learning have paved the way for significant progress in this field. A recently proposed method is FGFR, a food-centered fine-grained recognition method that leverages a multitask architecture in which different heads or tasks specialize in discriminating between classes of automatically detected subsets of hard-to-distinguish classes. In this work, we provide an in-depth analysis of the behavior of FGFR and propose an improved version, FGFR+, which builds on top of the limitations we identify from our study of the original method. While we prove that FGFR is capable of generalizing to other non-food domains and different types of backbone architectures, we also observe that the method is not taking full advantage of its specialized multi-head structure. We find that, by implementing a series of conceptually simple modifications, the performance of the method can be significantly boosted, capitalizing on the fine-grained knowledge provided by the heads. FGFR+ achieves 94.2% top-1 validation accuracy on the Food-101 dataset, virtually ranking third in its corresponding benchmark. Being compatible with a wide range of deep learning computer vision backbone architectures, FGFR+ could have the potential of boosting the performance of many computer vision classification tasks

UPCommons. Portal del coneixement obert de la UPC