50,940 research outputs found
Attend and Guide (AG-Net): A Keypoints-driven Attention-based Deep Network for Image Recognition
This paper presents a novel keypoints-based attention mechanism for visual
recognition in still images. Deep Convolutional Neural Networks (CNNs) for
recognizing images with distinctive classes have shown great success, but their
performance in discriminating fine-grained changes is not at the same level. We
address this by proposing an end-to-end CNN model, which learns meaningful
features linking fine-grained changes using our novel attention mechanism. It
captures the spatial structures in images by identifying semantic regions (SRs)
and their spatial distributions, and is proved to be the key to modelling
subtle changes in images. We automatically identify these SRs by grouping the
detected keypoints in a given image. The ``usefulness'' of these SRs for image
recognition is measured using our innovative attentional mechanism focusing on
parts of the image that are most relevant to a given task. This framework
applies to traditional and fine-grained image recognition tasks and does not
require manually annotated regions (e.g. bounding-box of body parts, objects,
etc.) for learning and prediction. Moreover, the proposed keypoints-driven
attention mechanism can be easily integrated into the existing CNN models. The
framework is evaluated on six diverse benchmark datasets. The model outperforms
the state-of-the-art approaches by a considerable margin using Distracted
Driver V1 (Acc: 3.39%), Distracted Driver V2 (Acc: 6.58%), Stanford-40 Actions
(mAP: 2.15%), People Playing Musical Instruments (mAP: 16.05%), Food-101 (Acc:
6.30%) and Caltech-256 (Acc: 2.59%) datasets.Comment: Published in IEEE Transaction on Image Processing 2021, Vol. 30, pp.
3691 - 370
SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization
Over the past few years, a significant progress has been made in deep
convolutional neural networks (CNNs)-based image recognition. This is mainly
due to the strong ability of such networks in mining discriminative object pose
and parts information from texture and shape. This is often inappropriate for
fine-grained visual classification (FGVC) since it exhibits high intra-class
and low inter-class variances due to occlusions, deformation, illuminations,
etc. Thus, an expressive feature representation describing global structural
information is a key to characterize an object/ scene. To this end, we propose
a method that effectively captures subtle changes by aggregating context-aware
features from most relevant image-regions and their importance in
discriminating fine-grained categories avoiding the bounding-box and/or
distinguishable part annotations. Our approach is inspired by the recent
advancement in self-attention and graph neural networks (GNNs) approaches to
include a simple yet effective relation-aware feature transformation and its
refinement using a context-aware attention mechanism to boost the
discriminability of the transformed feature in an end-to-end learning process.
Our model is evaluated on eight benchmark datasets consisting of fine-grained
objects and human-object interactions. It outperforms the state-of-the-art
approaches by a significant margin in recognition accuracy.Comment: Accepted manuscript - IEEE Transaction on Image Processin
SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization
Over the past few years, a significant progress has been made in deep
convolutional neural networks (CNNs)-based image recognition. This is mainly
due to the strong ability of such networks in mining discriminative object pose
and parts information from texture and shape. This is often inappropriate for
fine-grained visual classification (FGVC) since it exhibits high intra-class
and low inter-class variances due to occlusions, deformation, illuminations,
etc. Thus, an expressive feature representation describing global structural
information is a key to characterize an object/ scene. To this end, we propose
a method that effectively captures subtle changes by aggregating context-aware
features from most relevant image-regions and their importance in
discriminating fine-grained categories avoiding the bounding-box and/or
distinguishable part annotations. Our approach is inspired by the recent
advancement in self-attention and graph neural networks (GNNs) approaches to
include a simple yet effective relation-aware feature transformation and its
refinement using a context-aware attention mechanism to boost the
discriminability of the transformed feature in an end-to-end learning process.
Our model is evaluated on eight benchmark datasets consisting of fine-grained
objects and human-object interactions. It outperforms the state-of-the-art
approaches by a significant margin in recognition accuracy.Comment: Accepted manuscript - IEEE Transaction on Image Processin
Rail Infrastructure Defect Detection Through Video Analytics
University of Technology Sydney. Faculty of Engineering and Information Technology.Compared with the traditional railway infrastructure maintenance process, which relies on manual inspection by professional maintenance engineers, inspection through automatic video analytics will significantly improve the working efficiency and eliminate the potential safety concern by reducing physical contact between maintenance engineers and infrastructure facilities. However, the defect does not always have a stable appearance and involves many uncertainties exposed in the clutter environments. On the other hand, various brands of the same devices are used widely on the railway, which shows diverse physical models. Therefore, it creates many challenges to the existing computer vision algorithms for defect detection. In this thesis, two key challenges are abstracted about video/image analytics using computer vision techniques for railway infrastructure defect detection, resulting from the fine-grained defect recognition and the limited labelled learning (few-shot learning). This thesis summarizes the works that have been conducted on utilizing different methods to solve the two challenges.
The first challenge is fine-grained defect recognition. For railway infrastructure defect inspection, damaged or worn equipment defects are usually found in some small parts. That is, the differences between the defective ones and standard ones are fine-grained. Finding these subtle defects is a fine-grained recognition problem. This thesis proposes a bilinear CNNs model to tackle the defect detection problem, which effectively captures the invariant representation of the dataset and learns high-order discriminative features for fine-grained defect recognition. Another challenge is the limited labelled data. In many scenarios, how to obtain abundant labelled samples is laborious. For example, in industrial defect detection, most defects exist only in a few common categories, while most other categories only contain a small portion of defects. Moreover, annotating a large-scale dataset of defects is labour-intensive, which requires high expertise in railway maintenance. Thus, how to obtain an effective model with sparse labelled samples remains an open problem. To address this issue, this thesis proposes a framework to simultaneously reduce the intra-class variance and enlarge the inter-class discrimination for both fine-grained defect recognition and general fine-grained recognition under the few-shot setting. Three models are designed according to this framework, and comprehensive experimental analyses are provided to validate the effectiveness of the models. This thesis further studies the few-shot learning problem by mining the unlabelled information to boost the few-shot learner for defect/general object recognition and proposes a Poisson Transfer Model to maximize the value of the extra unlabelled data through robust classifier construction and self-supervised representation learning
Representation Learning for Shape Decomposition, By Shape Decomposition
The ability to parse 3D objects into their constituent parts is essential for humans to understand and interact with the surrounding world. Imparting this skill in machines is important for various computer graphics, computer vision, and robotics tasks. Machines endowed with this skill can better interact with its surroundings, perform shape editing, texturing, recomposing, tracking, and animation. In this thesis, we ask two questions. First, how can machines decompose 3D shapes into their fundamental parts? Second, does the ability to decompose the 3D shape into these parts help learn useful 3D shape representations?
In this thesis, we focus on parsing the shape into compact representations, such as parametric surface patches and Constructive Solid Geometry (CSG) primitives, which are also widely used representations in 3D modeling in computer graphics. Inspired by the advances in neural networks for 3D shape processing, we develop neural network approaches to tackle shape decomposition. First, we present CSGNet, a network architecture to parse shapes into CSG programs, which is trained using combination of supervised and reinforcement learning. Second, we present ParSeNet, a network architecture to decompose a shape into parametric surface patches (B-Spline) and geometric primitives (plane, cone, cylinder and sphere), trained on a large set of CAD models using supervised learning.
The training of deep neural network architectures for 3D recognition and generation tasks requires a large amount of labeled datasets. We explore ways to alleviate this problem by relying on shape decomposition methods to guide the learning process. Towards that end, we first study the use of freely available metadata, albeit inconsistent, from shape repositories to learn 3D shape features. Later we show that learning to decompose a 3D shape into geometric primitives also helps in learning shape representations useful for semantic segmentation tasks. Finally, since most 3D shapes encountered in real life are textured, consisting of several fine-grained semantic parts, we propose a method to learn fine-grained representations for textured 3D shapes in a self-supervised manner by incorporating 3D geometric priors
- …