2,561 research outputs found
Deep Structured Models for Large Scale Object Co-detection and Segmentation
Structured decisions are often required for a large variety of
image and scene understanding tasks in computer vision, with few
of them being object detection, localization, semantic
segmentation and many more. Structured prediction deals with
learning inherent structure by incorporating contextual
information from several images and multiple tasks. However, it
is very challenging when dealing with large scale image datasets
where performance is limited by high computational costs and
expressive power of the underlying representation learning
techniques. In this thesis,
we present efficient and effective deep structured models for
context-aware object detection, co-localization and
instance-level semantic segmentation.
First, we introduce a principled formulation for object
co-detection using a fully-connected conditional random field
(CRF). We build an explicit graph whose vertices represent object
candidates (instead of pixel values) and edges encode the object
similarity via simple, yet effective pairwise potentials. More
specifically, we design a weighted mixture of Gaussian kernels
for class-specific object similarity, and formulate kernel
weights estimation as a least-squares regression problem. Its
solution can therefore be obtained in closed-form. Furthermore,
in contrast with traditional co-detection approaches, it has been
shown that inference in such fully-connected CRFs can be
performed efficiently using an approximate mean-field method with
high-dimensional Gaussian filtering. This lets us effectively
leverage information in multiple images.
Next, we extend our class-specific co-detection framework to
multiple object categories. We model object candidates with rich,
high-dimensional features learned using a deep convolutional
neural network. In particular, our max-margin and directloss
structural boosting algorithms enable us to learn the most
suitable features that best encode pairwise similarity
relationships within our CRF framework. Furthermore, it
guarantees that the time and space complexity is O(n t) where n
is the total number of candidate boxes in the pool and t the
number of mean-field iterations.
Moreover, our experiments evidence the importance of learning
rich similarity measures to account for the contextual relations
across object classes and instances. However, all these methods
are based on precomputed object candidates (or proposals), thus
localization performance is limited by the quality of
bounding-boxes.
To address this, we present an efficient object proposal
co-generation technique that leverages the collective power of
multiple images. In particular, we design a deep neural network
layer that takes unary and pairwise features as input, builds a
fully-connected CRF and produces mean-field marginals as output.
It also lets us backpropagate the gradient through entire network
by unrolling the iterations of CRF inference. Furthermore, this
layer simplifies the end-to-end learning, thus effectively
benefiting from multiple candidates to co-generate high-quality
object proposals.
Finally, we develop a multi-task strategy to jointly learn object
detection, localization and instance-level semantic segmentation
in a single network. In particular, we introduce a novel
representation based on the distance transform of the object
masks. To this end, we design a new residual-deconvolution
architecture that infers such a representation and decodes it
into the final binary object mask. We show that the predicted
masks can go beyond the scope of the bounding boxes and that the
multiple tasks can benefit from each other.
In summary, in this thesis, we exploit the joint power of
multiple images as well as multiple tasks to improve
generalization performance of structured learning. Our novel deep
structured models, similarity learning techniques and
residual-deconvolution architecture can be used to make accurate
and reliable inference for key vision tasks. Furthermore, our
quantitative and qualitative experiments on large scale
challenging image datasets demonstrate the superiority of the
proposed approaches over the state-of-the-art methods
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
In this work we address the task of semantic image segmentation with Deep
Learning and make three main contributions that are experimentally shown to
have substantial practical merit. First, we highlight convolution with
upsampled filters, or 'atrous convolution', as a powerful tool in dense
prediction tasks. Atrous convolution allows us to explicitly control the
resolution at which feature responses are computed within Deep Convolutional
Neural Networks. It also allows us to effectively enlarge the field of view of
filters to incorporate larger context without increasing the number of
parameters or the amount of computation. Second, we propose atrous spatial
pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP
probes an incoming convolutional feature layer with filters at multiple
sampling rates and effective fields-of-views, thus capturing objects as well as
image context at multiple scales. Third, we improve the localization of object
boundaries by combining methods from DCNNs and probabilistic graphical models.
The commonly deployed combination of max-pooling and downsampling in DCNNs
achieves invariance but has a toll on localization accuracy. We overcome this
by combining the responses at the final DCNN layer with a fully connected
Conditional Random Field (CRF), which is shown both qualitatively and
quantitatively to improve localization performance. Our proposed "DeepLab"
system sets the new state-of-art at the PASCAL VOC-2012 semantic image
segmentation task, reaching 79.7% mIOU in the test set, and advances the
results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and
Cityscapes. All of our code is made publicly available online.Comment: Accepted by TPAM
Pixelwise Instance Segmentation with a Dynamically Instantiated Network
Semantic segmentation and object detection research have recently achieved
rapid progress. However, the former task has no notion of different instances
of the same object, and the latter operates at a coarse, bounding-box level. We
propose an Instance Segmentation system that produces a segmentation map where
each pixel is assigned an object class and instance identity label. Most
approaches adapt object detectors to produce segments instead of boxes. In
contrast, our method is based on an initial semantic segmentation module, which
feeds into an instance subnetwork. This subnetwork uses the initial
category-level segmentation, along with cues from the output of an object
detector, within an end-to-end CRF to predict instances. This part of our model
is dynamically instantiated to produce a variable number of instances per
image. Our end-to-end approach requires no post-processing and considers the
image holistically, instead of processing independent proposals. Therefore,
unlike some related work, a pixel cannot belong to multiple instances.
Furthermore, far more precise segmentations are achieved, as shown by our
state-of-the-art results (particularly at high IoU thresholds) on the Pascal
VOC and Cityscapes datasets.Comment: CVPR 201
Person re-identification via efficient inference in fully connected CRF
In this paper, we address the problem of person re-identification problem,
i.e., retrieving instances from gallery which are generated by the same person
as the given probe image. This is very challenging because the person's
appearance usually undergoes significant variations due to changes in
illumination, camera angle and view, background clutter, and occlusion over the
camera network. In this paper, we assume that the matched gallery images should
not only be similar to the probe, but also be similar to each other, under
suitable metric. We express this assumption with a fully connected CRF model in
which each node corresponds to a gallery and every pair of nodes are connected
by an edge. A label variable is associated with each node to indicate whether
the corresponding image is from target person. We define unary potential for
each node using existing feature calculation and matching techniques, which
reflect the similarity between probe and gallery image, and define pairwise
potential for each edge in terms of a weighed combination of Gaussian kernels,
which encode appearance similarity between pair of gallery images. The specific
form of pairwise potential allows us to exploit an efficient inference
algorithm to calculate the marginal distribution of each label variable for
this dense connected CRF. We show the superiority of our method by applying it
to public datasets and comparing with the state of the art.Comment: 7 pages, 4 figure
- …