Search CORE

16 research outputs found

Compositional Convolutional Neural Networks: A Deep Architecture with Innate Robustness to Partial Occlusion

Author: He Ju
Kortylewski Adam
Liu Qing
Yuille Alan
Publication venue
Publication date: 17/04/2020
Field of study

Recent findings show that deep convolutional neural networks (DCNNs) do not generalize well under partial occlusion. Inspired by the success of compositional models at classifying partially occluded objects, we propose to integrate compositional models and DCNNs into a unified deep model with innate robustness to partial occlusion. We term this architecture Compositional Convolutional Neural Network. In particular, we propose to replace the fully connected classification head of a DCNN with a differentiable compositional model. The generative nature of the compositional model enables it to localize occluders and subsequently focus on the non-occluded parts of the object. We conduct classification experiments on artificially occluded images as well as real images of partially occluded objects from the MS-COCO dataset. The results show that DCNNs do not classify occluded objects robustly, even when trained with data that is strongly augmented with partial occlusions. Our proposed model outperforms standard DCNNs by a large margin at classifying partially occluded objects, even when it has not been exposed to occluded objects during training. Additional experiments demonstrate that CompositionalNets can also localize the occluders accurately, despite being trained with class labels only. The code used in this work is publicly available.Comment: CVPR 2020; Code is available https://github.com/AdamKortylewski/CompositionalNets; Supplementary material: https://adamkortylewski.com/data/compnet_supp.pd

arXiv.org e-Print Archive

Crossref

Occlusion Coherence: Detecting and Localizing Occluded Faces

Author: Fowlkes Charless C.
Ghiasi Golnaz
Publication venue
Publication date: 24/08/2016
Field of study

The presence of occluders significantly impacts object recognition accuracy. However, occlusion is typically treated as an unstructured source of noise and explicit models for occluders have lagged behind those for object appearance and shape. In this paper we describe a hierarchical deformable part model for face detection and landmark localization that explicitly models part occlusion. The proposed model structure makes it possible to augment positive training data with large numbers of synthetically occluded instances. This allows us to easily incorporate the statistics of occlusion patterns in a discriminatively trained model. We test the model on several benchmarks for landmark localization and detection including challenging new data sets featuring significant occlusion. We find that the addition of an explicit occlusion model yields a detection system that outperforms existing approaches for occluded instances while maintaining competitive accuracy in detection and landmark localization for unoccluded instances

arXiv.org e-Print Archive

CiteSeerX

Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations

Author: Laina Iro
Larlus Diane
Tschernezki Vadim
Vedaldi Andrea
Publication venue
Publication date: 07/09/2022
Field of study

We present Neural Feature Fusion Fields (N3F), a method that improves dense 2D image feature extractors when the latter are applied to the analysis of multiple images reconstructible as a 3D scene. Given an image feature extractor, for example pre-trained using self-supervision, N3F uses it as a teacher to learn a student network defined in 3D space. The 3D student network is similar to a neural radiance field that distills said features and can be trained with the usual differentiable rendering machinery. As a consequence, N3F is readily applicable to most neural rendering formulations, including vanilla NeRF and its extensions to complex dynamic scenes. We show that our method not only enables semantic understanding in the context of scene-specific neural fields without the use of manual labels, but also consistently improves over the self-supervised 2D baselines. This is demonstrated by considering various tasks, such as 2D object retrieval, 3D segmentation, and scene editing, in diverse sequences, including long egocentric videos in the EPIC-KITCHENS benchmark.Comment: 3DV2022, Oral. Project page: https://www.robots.ox.ac.uk/~vadim/n3f

arXiv.org e-Print Archive

Robust Deep Learning Frameworks for Recognizing and Localizing Objects Accurately and Reliably

Author: Zhang Zhishuai
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/02/2021
Field of study

Detection is an important task in computer vision. It requires to recognize targets inside images, and localize them. The images can be 2D or 3D, and can be represented by dense pixels or sparse point clouds. With recent emergence and development of deep neural networks, many deep learning based detection frameworks have been proposed. They provide promising performance for many targets, e.g. natural objects, object parts, pedestrians and faces, thus are widely used in many applications, including surveillance, autonomous driving and medical image analysis. However, robust object detection is still challenging. Ideal detectors should be able to handle objects with unknown occluders, different scales/movements, long-tailed difficult objects, and low-contrast radiology inputs. Recent detectors are not designed with deliberate consideration of those challenges, and may have degraded performance. In this dissertation, we investigate those challenges, and propose novel detection frameworks to mitigate them. The aforementioned challenges are addressed in different aspects. (i) We address occlusion by proposing end-to-end voting mechanisms for vehicle part detection. It detects targets by accumulating cues relevant to the target. Occlusions eliminate some of the cues, but remaining cues are still able to detect the targets. (ii) We combine semantic segmentation with object detection, to enrich the detection features in multi-layer single-stage detectors. The enriched features capture both low-level details and high-level semantics, thus the quality of detection is significantly improved for both small and large objects due to stronger detection features. (iii) We investigate the issue of long-tailed hard examples and propose a hard image mining strategy. It dynamically identifies hard images and puts more training efforts during the training phase. This leads to models robust to long-tailed hard examples. (iv) For low-contrast multi-slice medical images, we design hybrid detectors to combine 2D and 3D information. Based on a stack of 2D CNNs for each image slice, we design 3D fusion modules to bridge context information from different 2D CNNs. (v) For objects moving in sequences, we design temporal region proposals to model the movements and interactions of them. We model the moving objects with spatial-temporal-interactive features for detecting them through past, current and future

JScholarship

Context-driven Object Detection and Segmentation with Auxiliary Information

Author: Wang Tao
Publication venue
Publication date
Field of study

One fundamental problem in computer vision and robotics is to localize objects of interest in an image. The task can either be formulated as an object detection problem if the objects are described by a set of pose parameters, or an object segmentation one if we recover object boundary precisely. A key issue in object detection and segmentation concerns exploiting the spatial context, as local evidence is often insufficient to determine object pose in the presence of heavy occlusions or large object appearance variations. This thesis addresses the object detection and segmentation problem in such adverse conditions with auxiliary depth data provided by RGBD cameras. We focus on four main issues in context-aware object detection and segmentation: 1) what are the effective context representations? 2) how can we work with limited and imperfect depth data? 3) how to design depth-aware features and integrate depth cues into conventional visual inference tasks? 4) how to make use of unlabeled data to relax the labeling requirements for training data? We discuss three object detection and segmentation scenarios based on varying amounts of available auxiliary information. In the first case, depth data are available for model training but not available for testing. We propose a structured Hough voting method for detecting objects with heavy occlusion in indoor environments, in which we extend the Hough hypothesis space to include both the object's location, and its visibility pattern. We design a new score function that accumulates votes for object detection and occlusion prediction. In addition, we explore the correlation between objects and their environment, building a depth-encoded object-context model based on RGBD data. In the second case, we address the problem of localizing glass objects with noisy and incomplete depth data. Our method integrates the intensity and depth information from a single view point, and builds a Markov Random Field that predicts glass boundary and region jointly. In addition, we propose a nonparametric, data-driven label transfer scheme for local glass boundary estimation. A weighted voting scheme based on a joint feature manifold is adopted to integrate depth and appearance cues, and we learn a distance metric on the depth-encoded feature manifold. In the third case, we make use of unlabeled data to relax the annotation requirements for object detection and segmentation, and propose a novel data-dependent margin distribution learning criterion for boosting, which utilizes the intrinsic geometric structure of datasets. One key aspect of this method is that it can seamlessly incorporate unlabeled data by including a graph Laplacian regularizer. We demonstrate the performance of our models and compare with baseline methods on several real-world object detection and segmentation tasks, including indoor object detection, glass object segmentation and foreground segmentation in video

The Australian National University

Recommended from our members

Pixel- and Frame-level Video Labeling using Spatial and Temporal Convolutional Networks

Author: Lei Peng
Publication venue: 'Oregon State University'
Publication date
Field of study

This dissertation addresses the problem of video labeling at both the frame and pixel levels using deep learning. For pixel-level video labeling, we have studied two problems: i) Spatiotemporal video segmentation and ii) Boundary detection and boundary flow estimation. For the problem of spatiotemporal video segmentation, we have developed recurrent temporal deep field (RTDF). RTDF is a conditional random field (CRF) that combines a deconvolution neural network and a recurrent temporal restricted Boltzmann machine (RTRBM), which can be jointly trained end-to-end. We have derived a mean- field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. For the problem of boundary detection and boundary flow estimation, we have proposed a fully convolutional Siamese network (FCSN). The FCSN first estimates object boundaries in two consecutive frames, and then predicts boundary correspondences in the two frames. For frame-level video labeling, we have specified a temporal deformable residual network (TDRN) for temporal action segmentation. TDRN computes two parallel tem- poral processes: i) Residual stream that analyzes video information at its full temporal resolution, and ii) Pooling/unpooling stream that captures long-range visual cues. The former facilitates local, fine-scale action segmentation, and the latter uses multiscale context for improving the accuracy of frame classification. All of our networks have been empirically evaluated on challenging benchmark datasets and compared with the state of the art. Each of the above approaches has outperformed the state of the art at the time of our evaluation

ScholarsArchive@OSU

Text Detection and Recognition in the Wild

Author: Raisi Zobeir
Publication venue: 'University of Waterloo'
Publication date: 07/07/2022
Field of study

Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments. The main objective for scene text detection is to locate text within images captured in the wild. For scene text recognition, the techniques map each detected or cropped word image into string. Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods. Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images. Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs), transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence. This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows: (1) We first targeted solving the irregular TDR in two separate architectures as follows: In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately. In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection. We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances. We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons. In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images. Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances. (2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures. In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images. (3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework. In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations

University of Waterloo's Institutional Repository

Face Alignment in the Wild.

Author: Yang Heng
Publication venue: 'Queen Mary University of London'
Publication date: 17/06/2016
Field of study

PhDFace alignment on a face image is a crucial step in many computer vision applications such as face recognition, verification and facial expression recognition. In this thesis we present a collection of methods for face alignment in real-world scenarios where the acquisition of the face images cannot be controlled. We first investigate local based random regression forest methods that work in a voting fashion. We focus on building better quality random trees, first, by using privileged information and second, in contrast to using explicit shape models, by incorporating spatial shape constraints within the forests. We also propose a fine-tuning scheme that sieves and/or aggregates regression forest votes before accumulating them into the Hough space. We then investigate holistic methods and propose two schemes, namely the cascaded regression forests and the random subspace supervised descent method (RSSDM). The former uses a regression forest as the primitive regressor instead of random ferns and an intelligent initialization scheme. The RSSDM improves the accuracy and generalization capacity of the popular SDM by using several linear regressions in random subspaces. We also propose a Cascaded Pose Regression framework for face alignment in different modalities, that is RGB and sketch images, based on a sketch synthesis scheme. Finally, we introduce the concept of mirrorability which describes how an object alignment method behaves on mirror images in comparison to how it behaves on the original ones. We define a measure called mirror error to quantitatively analyse the mirrorability and show two applications, namely difficult samples selection and cascaded face alignment feedback that aids a re-initialisation scheme. The methods proposed in this thesis perform better or comparable to state of the art methods. We also demonstrate the generality by applying them on similar problems such as car alignment.China Scholarship Counci

Queen Mary Research Online