50,940 research outputs found

    Attend and Guide (AG-Net): A Keypoints-driven Attention-based Deep Network for Image Recognition

    Get PDF
    This paper presents a novel keypoints-based attention mechanism for visual recognition in still images. Deep Convolutional Neural Networks (CNNs) for recognizing images with distinctive classes have shown great success, but their performance in discriminating fine-grained changes is not at the same level. We address this by proposing an end-to-end CNN model, which learns meaningful features linking fine-grained changes using our novel attention mechanism. It captures the spatial structures in images by identifying semantic regions (SRs) and their spatial distributions, and is proved to be the key to modelling subtle changes in images. We automatically identify these SRs by grouping the detected keypoints in a given image. The ``usefulness'' of these SRs for image recognition is measured using our innovative attentional mechanism focusing on parts of the image that are most relevant to a given task. This framework applies to traditional and fine-grained image recognition tasks and does not require manually annotated regions (e.g. bounding-box of body parts, objects, etc.) for learning and prediction. Moreover, the proposed keypoints-driven attention mechanism can be easily integrated into the existing CNN models. The framework is evaluated on six diverse benchmark datasets. The model outperforms the state-of-the-art approaches by a considerable margin using Distracted Driver V1 (Acc: 3.39%), Distracted Driver V2 (Acc: 6.58%), Stanford-40 Actions (mAP: 2.15%), People Playing Musical Instruments (mAP: 16.05%), Food-101 (Acc: 6.30%) and Caltech-256 (Acc: 2.59%) datasets.Comment: Published in IEEE Transaction on Image Processing 2021, Vol. 30, pp. 3691 - 370

    SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization

    Get PDF
    Over the past few years, a significant progress has been made in deep convolutional neural networks (CNNs)-based image recognition. This is mainly due to the strong ability of such networks in mining discriminative object pose and parts information from texture and shape. This is often inappropriate for fine-grained visual classification (FGVC) since it exhibits high intra-class and low inter-class variances due to occlusions, deformation, illuminations, etc. Thus, an expressive feature representation describing global structural information is a key to characterize an object/ scene. To this end, we propose a method that effectively captures subtle changes by aggregating context-aware features from most relevant image-regions and their importance in discriminating fine-grained categories avoiding the bounding-box and/or distinguishable part annotations. Our approach is inspired by the recent advancement in self-attention and graph neural networks (GNNs) approaches to include a simple yet effective relation-aware feature transformation and its refinement using a context-aware attention mechanism to boost the discriminability of the transformed feature in an end-to-end learning process. Our model is evaluated on eight benchmark datasets consisting of fine-grained objects and human-object interactions. It outperforms the state-of-the-art approaches by a significant margin in recognition accuracy.Comment: Accepted manuscript - IEEE Transaction on Image Processin

    SR-GNN: Spatial Relation-aware Graph Neural Network for Fine-Grained Image Categorization

    Full text link
    Over the past few years, a significant progress has been made in deep convolutional neural networks (CNNs)-based image recognition. This is mainly due to the strong ability of such networks in mining discriminative object pose and parts information from texture and shape. This is often inappropriate for fine-grained visual classification (FGVC) since it exhibits high intra-class and low inter-class variances due to occlusions, deformation, illuminations, etc. Thus, an expressive feature representation describing global structural information is a key to characterize an object/ scene. To this end, we propose a method that effectively captures subtle changes by aggregating context-aware features from most relevant image-regions and their importance in discriminating fine-grained categories avoiding the bounding-box and/or distinguishable part annotations. Our approach is inspired by the recent advancement in self-attention and graph neural networks (GNNs) approaches to include a simple yet effective relation-aware feature transformation and its refinement using a context-aware attention mechanism to boost the discriminability of the transformed feature in an end-to-end learning process. Our model is evaluated on eight benchmark datasets consisting of fine-grained objects and human-object interactions. It outperforms the state-of-the-art approaches by a significant margin in recognition accuracy.Comment: Accepted manuscript - IEEE Transaction on Image Processin

    Rail Infrastructure Defect Detection Through Video Analytics

    Full text link
    University of Technology Sydney. Faculty of Engineering and Information Technology.Compared with the traditional railway infrastructure maintenance process, which relies on manual inspection by professional maintenance engineers, inspection through automatic video analytics will significantly improve the working efficiency and eliminate the potential safety concern by reducing physical contact between maintenance engineers and infrastructure facilities. However, the defect does not always have a stable appearance and involves many uncertainties exposed in the clutter environments. On the other hand, various brands of the same devices are used widely on the railway, which shows diverse physical models. Therefore, it creates many challenges to the existing computer vision algorithms for defect detection. In this thesis, two key challenges are abstracted about video/image analytics using computer vision techniques for railway infrastructure defect detection, resulting from the fine-grained defect recognition and the limited labelled learning (few-shot learning). This thesis summarizes the works that have been conducted on utilizing different methods to solve the two challenges. The first challenge is fine-grained defect recognition. For railway infrastructure defect inspection, damaged or worn equipment defects are usually found in some small parts. That is, the differences between the defective ones and standard ones are fine-grained. Finding these subtle defects is a fine-grained recognition problem. This thesis proposes a bilinear CNNs model to tackle the defect detection problem, which effectively captures the invariant representation of the dataset and learns high-order discriminative features for fine-grained defect recognition. Another challenge is the limited labelled data. In many scenarios, how to obtain abundant labelled samples is laborious. For example, in industrial defect detection, most defects exist only in a few common categories, while most other categories only contain a small portion of defects. Moreover, annotating a large-scale dataset of defects is labour-intensive, which requires high expertise in railway maintenance. Thus, how to obtain an effective model with sparse labelled samples remains an open problem. To address this issue, this thesis proposes a framework to simultaneously reduce the intra-class variance and enlarge the inter-class discrimination for both fine-grained defect recognition and general fine-grained recognition under the few-shot setting. Three models are designed according to this framework, and comprehensive experimental analyses are provided to validate the effectiveness of the models. This thesis further studies the few-shot learning problem by mining the unlabelled information to boost the few-shot learner for defect/general object recognition and proposes a Poisson Transfer Model to maximize the value of the extra unlabelled data through robust classifier construction and self-supervised representation learning

    Representation Learning for Shape Decomposition, By Shape Decomposition

    Get PDF
    The ability to parse 3D objects into their constituent parts is essential for humans to understand and interact with the surrounding world. Imparting this skill in machines is important for various computer graphics, computer vision, and robotics tasks. Machines endowed with this skill can better interact with its surroundings, perform shape editing, texturing, recomposing, tracking, and animation. In this thesis, we ask two questions. First, how can machines decompose 3D shapes into their fundamental parts? Second, does the ability to decompose the 3D shape into these parts help learn useful 3D shape representations? In this thesis, we focus on parsing the shape into compact representations, such as parametric surface patches and Constructive Solid Geometry (CSG) primitives, which are also widely used representations in 3D modeling in computer graphics. Inspired by the advances in neural networks for 3D shape processing, we develop neural network approaches to tackle shape decomposition. First, we present CSGNet, a network architecture to parse shapes into CSG programs, which is trained using combination of supervised and reinforcement learning. Second, we present ParSeNet, a network architecture to decompose a shape into parametric surface patches (B-Spline) and geometric primitives (plane, cone, cylinder and sphere), trained on a large set of CAD models using supervised learning. The training of deep neural network architectures for 3D recognition and generation tasks requires a large amount of labeled datasets. We explore ways to alleviate this problem by relying on shape decomposition methods to guide the learning process. Towards that end, we first study the use of freely available metadata, albeit inconsistent, from shape repositories to learn 3D shape features. Later we show that learning to decompose a 3D shape into geometric primitives also helps in learning shape representations useful for semantic segmentation tasks. Finally, since most 3D shapes encountered in real life are textured, consisting of several fine-grained semantic parts, we propose a method to learn fine-grained representations for textured 3D shapes in a self-supervised manner by incorporating 3D geometric priors
    • …
    corecore