16 research outputs found
Compositional Convolutional Neural Networks: A Deep Architecture with Innate Robustness to Partial Occlusion
Recent findings show that deep convolutional neural networks (DCNNs) do not
generalize well under partial occlusion. Inspired by the success of
compositional models at classifying partially occluded objects, we propose to
integrate compositional models and DCNNs into a unified deep model with innate
robustness to partial occlusion. We term this architecture Compositional
Convolutional Neural Network. In particular, we propose to replace the fully
connected classification head of a DCNN with a differentiable compositional
model. The generative nature of the compositional model enables it to localize
occluders and subsequently focus on the non-occluded parts of the object. We
conduct classification experiments on artificially occluded images as well as
real images of partially occluded objects from the MS-COCO dataset. The results
show that DCNNs do not classify occluded objects robustly, even when trained
with data that is strongly augmented with partial occlusions. Our proposed
model outperforms standard DCNNs by a large margin at classifying partially
occluded objects, even when it has not been exposed to occluded objects during
training. Additional experiments demonstrate that CompositionalNets can also
localize the occluders accurately, despite being trained with class labels
only. The code used in this work is publicly available.Comment: CVPR 2020; Code is available
https://github.com/AdamKortylewski/CompositionalNets; Supplementary material:
https://adamkortylewski.com/data/compnet_supp.pd
Occlusion Coherence: Detecting and Localizing Occluded Faces
The presence of occluders significantly impacts object recognition accuracy.
However, occlusion is typically treated as an unstructured source of noise and
explicit models for occluders have lagged behind those for object appearance
and shape. In this paper we describe a hierarchical deformable part model for
face detection and landmark localization that explicitly models part occlusion.
The proposed model structure makes it possible to augment positive training
data with large numbers of synthetically occluded instances. This allows us to
easily incorporate the statistics of occlusion patterns in a discriminatively
trained model. We test the model on several benchmarks for landmark
localization and detection including challenging new data sets featuring
significant occlusion. We find that the addition of an explicit occlusion model
yields a detection system that outperforms existing approaches for occluded
instances while maintaining competitive accuracy in detection and landmark
localization for unoccluded instances
Neural Feature Fusion Fields: 3D Distillation of Self-Supervised 2D Image Representations
We present Neural Feature Fusion Fields (N3F), a method that improves dense
2D image feature extractors when the latter are applied to the analysis of
multiple images reconstructible as a 3D scene. Given an image feature
extractor, for example pre-trained using self-supervision, N3F uses it as a
teacher to learn a student network defined in 3D space. The 3D student network
is similar to a neural radiance field that distills said features and can be
trained with the usual differentiable rendering machinery. As a consequence,
N3F is readily applicable to most neural rendering formulations, including
vanilla NeRF and its extensions to complex dynamic scenes. We show that our
method not only enables semantic understanding in the context of scene-specific
neural fields without the use of manual labels, but also consistently improves
over the self-supervised 2D baselines. This is demonstrated by considering
various tasks, such as 2D object retrieval, 3D segmentation, and scene editing,
in diverse sequences, including long egocentric videos in the EPIC-KITCHENS
benchmark.Comment: 3DV2022, Oral. Project page: https://www.robots.ox.ac.uk/~vadim/n3f
Robust Deep Learning Frameworks for Recognizing and Localizing Objects Accurately and Reliably
Detection is an important task in computer vision. It requires to recognize targets inside images, and localize them. The images can be 2D or 3D, and can be represented by dense pixels or sparse point clouds. With recent emergence and development of deep neural networks, many deep learning based detection frameworks have been proposed. They provide promising performance for many targets, e.g. natural objects, object parts, pedestrians and faces, thus are widely used in many applications, including surveillance, autonomous driving and medical image analysis. However, robust object detection is still challenging. Ideal detectors should be able to handle objects with unknown occluders, different scales/movements, long-tailed difficult objects, and low-contrast radiology inputs. Recent detectors are not designed with deliberate consideration of those challenges, and may have degraded performance. In this dissertation, we investigate those challenges, and propose novel detection frameworks to mitigate them.
The aforementioned challenges are addressed in different aspects. (i) We address occlusion by proposing end-to-end voting mechanisms for vehicle part detection. It detects targets by accumulating cues relevant to the target. Occlusions eliminate some of the cues, but remaining cues are still able to detect the targets. (ii) We combine semantic segmentation with object detection, to enrich the detection features in multi-layer single-stage detectors. The enriched features capture both low-level details and high-level semantics, thus the quality of detection is significantly improved for both small and large objects due to stronger detection features. (iii) We investigate the issue of long-tailed hard examples and propose a hard image mining strategy. It dynamically identifies hard images and puts more training efforts during the training phase. This leads to models robust to long-tailed hard examples. (iv) For low-contrast multi-slice medical images, we design hybrid detectors to combine 2D and 3D information. Based on a stack of 2D CNNs for each image slice, we design 3D fusion modules to bridge context information from different 2D CNNs. (v) For objects moving in sequences, we design temporal region proposals to model the movements and interactions of them. We model the moving objects with spatial-temporal-interactive features for detecting them through past, current and future
Context-driven Object Detection and Segmentation with Auxiliary Information
One fundamental problem in computer vision and robotics is to
localize objects of interest in an image. The task can either be
formulated as an object detection problem if the objects are
described by a set of pose parameters, or an object segmentation
one if we recover object boundary precisely. A key issue in
object detection and segmentation concerns exploiting the spatial
context, as local evidence is often insufficient to determine
object pose in the presence of heavy occlusions or large object
appearance variations. This thesis addresses the object detection
and segmentation problem in such adverse conditions with
auxiliary depth data provided by RGBD cameras. We focus on four
main issues in context-aware object detection and segmentation:
1) what are the effective context representations? 2) how can we
work with limited and imperfect depth data? 3) how to design
depth-aware features and integrate depth cues into conventional
visual inference tasks? 4) how to make use of unlabeled data to
relax the labeling requirements for training data?
We discuss three object detection and segmentation scenarios
based on varying amounts of available auxiliary information. In
the first case, depth data are available for model training but
not available for testing. We propose a structured Hough voting
method for detecting objects with heavy occlusion in indoor
environments, in which we extend the Hough hypothesis space to
include both the object's location, and its visibility pattern.
We design a new score function that accumulates votes for object
detection and occlusion prediction. In addition, we explore the
correlation between objects and their environment, building a
depth-encoded object-context model based on RGBD data. In the
second case, we address the problem of localizing glass objects
with noisy and incomplete depth data. Our method integrates the
intensity and depth information from a single view point, and
builds a Markov Random Field that predicts glass boundary and
region jointly. In addition, we propose a nonparametric,
data-driven label transfer scheme for local glass boundary
estimation. A weighted voting scheme based on a joint feature
manifold is adopted to integrate depth and appearance cues, and
we learn a distance metric on the depth-encoded feature manifold.
In the third case, we make use of unlabeled data to relax the
annotation requirements for object detection and segmentation,
and propose a novel data-dependent margin distribution learning
criterion for boosting, which utilizes the intrinsic geometric
structure of datasets. One key aspect of this method is that it
can seamlessly incorporate unlabeled data by including a graph
Laplacian regularizer. We demonstrate the performance of our
models and compare with baseline methods on several real-world
object detection and segmentation tasks, including indoor object
detection, glass object segmentation and foreground segmentation
in video
Recommended from our members
Pixel- and Frame-level Video Labeling using Spatial and Temporal Convolutional Networks
This dissertation addresses the problem of video labeling at both the frame and pixel levels using deep learning. For pixel-level video labeling, we have studied two problems: i) Spatiotemporal video segmentation and ii) Boundary detection and boundary flow estimation. For the problem of spatiotemporal video segmentation, we have developed recurrent temporal deep field (RTDF). RTDF is a conditional random field (CRF) that combines a deconvolution neural network and a recurrent temporal restricted Boltzmann machine (RTRBM), which can be jointly trained end-to-end. We have derived a mean- field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. For the problem of boundary detection and boundary flow estimation, we have proposed a fully convolutional Siamese network (FCSN). The FCSN first estimates object boundaries in two consecutive frames, and then predicts boundary correspondences in the two frames. For frame-level video labeling, we have specified a temporal deformable residual network (TDRN) for temporal action segmentation. TDRN computes two parallel tem- poral processes: i) Residual stream that analyzes video information at its full temporal resolution, and ii) Pooling/unpooling stream that captures long-range visual cues. The former facilitates local, fine-scale action segmentation, and the latter uses multiscale context for improving the accuracy of frame classification. All of our networks have been empirically evaluated on challenging benchmark datasets and compared with the state of the art. Each of the above approaches has outperformed the state of the art at the time of our evaluation
Text Detection and Recognition in the Wild
Text detection and recognition (TDR) in highly structured environments with a clean background and consistent fonts (e.g., office documents, postal addresses and bank cheque) is a well understood problem (i.e., OCR), however this is not the case for unstructured environments.
The main objective for scene text detection is to locate text within images captured in the wild.
For scene text recognition, the techniques map each detected or cropped word image into string.
Nowadays, convolutional neural networks (CNNs) and Recurrent Neural Networks (RNN) deep learning architectures dominate most of the recent state-of-the-art (SOTA) scene TDR methods.
Most of the reported respective accuracies of current SOTA TDR methods are in the range of 80% to 90% on benchmark datasets with regular and clear text instances. However, those detecting and/or recognizing results drastically deteriorate 10% and 30% - in terms of F-measure detection and word recognition accuracy performances with irregular or occluded text images.
Transformers and their variations are new deep learning architectures that mitigate the above-mentioned issues for CNN and RNN-based pipelines.Unlike Recurrent Neural Networks (RNNs),
transformers are models that learn how to encode and decode data by looking not only backward but also forward in order to extract relevant information from a whole sequence.
This thesis utilizes the transformer architecture to address the irregular (multi-oriented and arbitrarily shaped) and occluded text challenges in the wild images. Our main contributions are as follows:
(1) We first targeted solving the irregular TDR in two separate architectures as follows:
In Chapter 4, unlike the SOTA text detection frameworks that have complex pipelines and use many hand-designed components and post-processing stages, we design a conceptually more straightforward and trainable end-to-end architecture of transformer-based detector for multi-oriented scene text detection, which can directly predict the set of detections (i.e., text and box regions) of the input image. A central contribution to our work is introducing a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to capture the rotated text instances adequately.
In Chapter 5, we extend our previous architecture to arbitrary shaped scene text detection.
We design a new text detection technique that aims to better infer n-vertices of a polygon or the degree of a Bezier curve to represent irregular-text instances.
We also propose a loss function that combines a generalized-split-intersection-over union loss defined over the piece-wise polygons.
In Chapter 6, we show that our transformer-based architecture without rectifying the input curved text instances is more suitable than SOTA RNN-based frameworks equipped with rectification modules for irregular text recognition in the wild images.
Our main contribution to this chapter is leveraging a 2D Learnable Sinusoidal frequencies Positional Encoding (2LSPE) with a modified feed-forward neural network to better encode the 2D spatial dependencies of characters in the irregular text instances.
(2) Since TDR tasks encounter the same challenging problems (e.g., irregular text, illumination variations, low-resolution text, etc.), we present a new transformer model that can detect and recognize individual characters of text instances in an end-to-end manner. Reading individual characters later makes a robust occlusion and arbitrarily shaped text spotting model without needing polygon annotation or multiple stages of detection and recognition modules used in SOTA text spotting architectures.
In Chapter 7, unlike SOTA methods that combine two different pipelines of detection and recognition modules for a complete text reading, we utilize our text detection framework by leveraging a recent transformer-based technique, namely Deformable Patch-based Transformer (DPT), as a feature extracting backbone, to robustly read the class and box coordinates of irregular characters in the wild images.
(3) Finally, we address the occlusion problem by using a multi-task end-to-end scene text spotting framework.
In Chapter 8, we leverage a recent transformer-based framework in deep learning, namely Masked Auto Encoder (MAE), as a backbone for scene text recognition and end-to-end scene text spotting pipelines to overcome the partial occlusion limitation. We design a new multitask End-to-End transformer network that directly outputs characters, word instances, and their bounding box representations, saving the computational overhead as it eliminates multiple processing steps. The unified proposed framework can also detect and recognize arbitrarily shaped text instances without using polygon annotations
Face Alignment in the Wild.
PhDFace alignment on a face image is a crucial step in many computer vision applications such
as face recognition, verification and facial expression recognition. In this thesis we present
a collection of methods for face alignment in real-world scenarios where the acquisition
of the face images cannot be controlled. We first investigate local based random regression
forest methods that work in a voting fashion. We focus on building better quality
random trees, first, by using privileged information and second, in contrast to using explicit
shape models, by incorporating spatial shape constraints within the forests. We also
propose a fine-tuning scheme that sieves and/or aggregates regression forest votes before
accumulating them into the Hough space. We then investigate holistic methods and propose
two schemes, namely the cascaded regression forests and the random subspace supervised
descent method (RSSDM). The former uses a regression forest as the primitive regressor
instead of random ferns and an intelligent initialization scheme. The RSSDM improves the
accuracy and generalization capacity of the popular SDM by using several linear regressions
in random subspaces. We also propose a Cascaded Pose Regression framework for
face alignment in different modalities, that is RGB and sketch images, based on a sketch
synthesis scheme. Finally, we introduce the concept of mirrorability which describes how
an object alignment method behaves on mirror images in comparison to how it behaves on
the original ones. We define a measure called mirror error to quantitatively analyse the mirrorability
and show two applications, namely difficult samples selection and cascaded face
alignment feedback that aids a re-initialisation scheme. The methods proposed in this thesis
perform better or comparable to state of the art methods. We also demonstrate the generality
by applying them on similar problems such as car alignment.China Scholarship Counci