3,784 research outputs found
Detecting Semantic Parts on Partially Occluded Objects
In this paper, we address the task of detecting semantic parts on partially
occluded objects. We consider a scenario where the model is trained using
non-occluded images but tested on occluded images. The motivation is that there
are infinite number of occlusion patterns in real world, which cannot be fully
covered in the training data. So the models should be inherently robust and
adaptive to occlusions instead of fitting / learning the occlusion patterns in
the training data. Our approach detects semantic parts by accumulating the
confidence of local visual cues. Specifically, the method uses a simple voting
method, based on log-likelihood ratio tests and spatial constraints, to combine
the evidence of local cues. These cues are called visual concepts, which are
derived by clustering the internal states of deep networks. We evaluate our
voting scheme on the VehicleSemanticPart dataset with dense part annotations.
We randomly place two, three or four irrelevant objects onto the target object
to generate testing images with various occlusions. Experiments show that our
algorithm outperforms several competitors in semantic part detection when
occlusions are present.Comment: Accepted to BMVC 2017 (13 pages, 3 figures
DeepVoting: A Robust and Explainable Deep Network for Semantic Part Detection under Partial Occlusion
In this paper, we study the task of detecting semantic parts of an object,
e.g., a wheel of a car, under partial occlusion. We propose that all models
should be trained without seeing occlusions while being able to transfer the
learned knowledge to deal with occlusions. This setting alleviates the
difficulty in collecting an exponentially large dataset to cover occlusion
patterns and is more essential. In this scenario, the proposal-based deep
networks, like RCNN-series, often produce unsatisfactory results, because both
the proposal extraction and classification stages may be confused by the
irrelevant occluders. To address this, [25] proposed a voting mechanism that
combines multiple local visual cues to detect semantic parts. The semantic
parts can still be detected even though some visual cues are missing due to
occlusions. However, this method is manually-designed, thus is hard to be
optimized in an end-to-end manner.
In this paper, we present DeepVoting, which incorporates the robustness shown
by [25] into a deep network, so that the whole pipeline can be jointly
optimized. Specifically, it adds two layers after the intermediate features of
a deep network, e.g., the pool-4 layer of VGGNet. The first layer extracts the
evidence of local visual cues, and the second layer performs a voting mechanism
by utilizing the spatial relationship between visual cues and semantic parts.
We also propose an improved version DeepVoting+ by learning visual cues from
context outside objects. In experiments, DeepVoting achieves significantly
better performance than several baseline methods, including Faster-RCNN, for
semantic part detection under occlusion. In addition, DeepVoting enjoys
explainability as the detection results can be diagnosed via looking up the
voting cues
Visual Concepts and Compositional Voting
It is very attractive to formulate vision in terms of pattern theory
\cite{Mumford2010pattern}, where patterns are defined hierarchically by
compositions of elementary building blocks. But applying pattern theory to real
world images is currently less successful than discriminative methods such as
deep networks. Deep networks, however, are black-boxes which are hard to
interpret and can easily be fooled by adding occluding objects. It is natural
to wonder whether by better understanding deep networks we can extract building
blocks which can be used to develop pattern theoretic models. This motivates us
to study the internal representations of a deep network using vehicle images
from the PASCAL3D+ dataset. We use clustering algorithms to study the
population activities of the features and extract a set of visual concepts
which we show are visually tight and correspond to semantic parts of vehicles.
To analyze this we annotate these vehicles by their semantic parts to create a
new dataset, VehicleSemanticParts, and evaluate visual concepts as unsupervised
part detectors. We show that visual concepts perform fairly well but are
outperformed by supervised discriminative methods such as Support Vector
Machines (SVM). We next give a more detailed analysis of visual concepts and
how they relate to semantic parts. Following this, we use the visual concepts
as building blocks for a simple pattern theoretical model, which we call
compositional voting. In this model several visual concepts combine to detect
semantic parts. We show that this approach is significantly better than
discriminative methods like SVM and deep networks trained specifically for
semantic part detection. Finally, we return to studying occlusion by creating
an annotated dataset with occlusion, called VehicleOcclusion, and show that
compositional voting outperforms even deep networks when the amount of
occlusion becomes large.Comment: It is accepted by Annals of Mathematical Sciences and Application
Grid Loss: Detecting Occluded Faces
Detection of partially occluded objects is a challenging computer vision
problem. Standard Convolutional Neural Network (CNN) detectors fail if parts of
the detection window are occluded, since not every sub-part of the window is
discriminative on its own. To address this issue, we propose a novel loss layer
for CNNs, named grid loss, which minimizes the error rate on sub-blocks of a
convolution layer independently rather than over the whole feature map. This
results in parts being more discriminative on their own, enabling the detector
to recover if the detection window is partially occluded. By mapping our loss
layer back to a regular fully connected layer, no additional computational cost
is incurred at runtime compared to standard CNNs. We demonstrate our method for
face detection on several public face detection benchmarks and show that our
method outperforms regular CNNs, is suitable for realtime applications and
achieves state-of-the-art performance.Comment: accepted to ECCV 201
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
- …