9,079 research outputs found
Joint Object and Part Segmentation using Deep Learned Potentials
Segmenting semantic objects from images and parsing them into their
respective semantic parts are fundamental steps towards detailed object
understanding in computer vision. In this paper, we propose a joint solution
that tackles semantic object and part segmentation simultaneously, in which
higher object-level context is provided to guide part segmentation, and more
detailed part-level localization is utilized to refine object segmentation.
Specifically, we first introduce the concept of semantic compositional parts
(SCP) in which similar semantic parts are grouped and shared among different
objects. A two-channel fully convolutional network (FCN) is then trained to
provide the SCP and object potentials at each pixel. At the same time, a
compact set of segments can also be obtained from the SCP predictions of the
network. Given the potentials and the generated segments, in order to explore
long-range context, we finally construct an efficient fully connected
conditional random field (FCRF) to jointly predict the final object and part
labels. Extensive evaluation on three different datasets shows that our
approach can mutually enhance the performance of object and part segmentation,
and outperforms the current state-of-the-art on both tasks
Structured Learning of Tree Potentials in CRF for Image Segmentation
We propose a new approach to image segmentation, which exploits the
advantages of both conditional random fields (CRFs) and decision trees. In the
literature, the potential functions of CRFs are mostly defined as a linear
combination of some pre-defined parametric models, and then methods like
structured support vector machines (SSVMs) are applied to learn those linear
coefficients. We instead formulate the unary and pairwise potentials as
nonparametric forests---ensembles of decision trees, and learn the ensemble
parameters and the trees in a unified optimization problem within the
large-margin framework. In this fashion, we easily achieve nonlinear learning
of potential functions on both unary and pairwise terms in CRFs. Moreover, we
learn class-wise decision trees for each object that appears in the image. Due
to the rich structure and flexibility of decision trees, our approach is
powerful in modelling complex data likelihoods and label relationships. The
resulting optimization problem is very challenging because it can have
exponentially many variables and constraints. We show that this challenging
optimization can be efficiently solved by combining a modified column
generation and cutting-planes techniques. Experimental results on both binary
(Graz-02, Weizmann horse, Oxford flower) and multi-class (MSRC-21, PASCAL VOC
2012) segmentation datasets demonstrate the power of the learned nonlinear
nonparametric potentials.Comment: 10 pages. Appearing in IEEE Transactions on Neural Networks and
Learning System
A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
Recently, Minimum Cost Multicut Formulations have been proposed and proven to
be successful in both motion trajectory segmentation and multi-target tracking
scenarios. Both tasks benefit from decomposing a graphical model into an
optimal number of connected components based on attractive and repulsive
pairwise terms. The two tasks are formulated on different levels of granularity
and, accordingly, leverage mostly local information for motion segmentation and
mostly high-level information for multi-target tracking. In this paper we argue
that point trajectories and their local relationships can contribute to the
high-level task of multi-target tracking and also argue that high-level cues
from object detection and tracking are helpful to solve motion segmentation. We
propose a joint graphical model for point trajectories and object detections
whose Multicuts are solutions to motion segmentation {\it and} multi-target
tracking problems at once. Results on the FBMS59 motion segmentation benchmark
as well as on pedestrian tracking sequences from the 2D MOT 2015 benchmark
demonstrate the promise of this joint approach
SEGCloud: Semantic Segmentation of 3D Point Clouds
3D semantic scene labeling is fundamental to agents operating in the real
world. In particular, labeling raw 3D point sets from sensors provides
fine-grained semantics. Recent works leverage the capabilities of Neural
Networks (NNs), but are limited to coarse voxel predictions and do not
explicitly enforce global consistency. We present SEGCloud, an end-to-end
framework to obtain 3D point-level segmentation that combines the advantages of
NNs, trilinear interpolation(TI) and fully connected Conditional Random Fields
(FC-CRF). Coarse voxel predictions from a 3D Fully Convolutional NN are
transferred back to the raw 3D points via trilinear interpolation. Then the
FC-CRF enforces global consistency and provides fine-grained semantics on the
points. We implement the latter as a differentiable Recurrent NN to allow joint
optimization. We evaluate the framework on two indoor and two outdoor 3D
datasets (NYU V2, S3DIS, KITTI, Semantic3D.net), and show performance
comparable or superior to the state-of-the-art on all datasets.Comment: Accepted as a spotlight at the International Conference of 3D Vision
(3DV 2017
Joint Multi-Person Pose Estimation and Semantic Part Segmentation
Human pose estimation and semantic part segmentation are two complementary
tasks in computer vision. In this paper, we propose to solve the two tasks
jointly for natural multi-person images, in which the estimated pose provides
object-level shape prior to regularize part segments while the part-level
segments constrain the variation of pose locations. Specifically, we first
train two fully convolutional neural networks (FCNs), namely Pose FCN and Part
FCN, to provide initial estimation of pose joint potential and semantic part
potential. Then, to refine pose joint location, the two types of potentials are
fused with a fully-connected conditional random field (FCRF), where a novel
segment-joint smoothness term is used to encourage semantic and spatial
consistency between parts and joints. To refine part segments, the refined pose
and the original part potential are integrated through a Part FCN, where the
skeleton feature from pose serves as additional regularization cues for part
segments. Finally, to reduce the complexity of the FCRF, we induce human
detection boxes and infer the graph inside each box, making the inference forty
times faster.
Since there's no dataset that contains both part segments and pose labels, we
extend the PASCAL VOC part dataset with human pose joints and perform extensive
experiments to compare our method against several most recent strategies. We
show that on this dataset our algorithm surpasses competing methods by a large
margin in both tasks.Comment: This paper has been accepted by CVPR 201
Deeply Learning the Messages in Message Passing Inference
Deep structured output learning shows great promise in tasks like semantic
image segmentation. We proffer a new, efficient deep structured model learning
scheme, in which we show how deep Convolutional Neural Networks (CNNs) can be
used to estimate the messages in message passing inference for structured
prediction with Conditional Random Fields (CRFs). With such CNN message
estimators, we obviate the need to learn or evaluate potential functions for
message calculation. This confers significant efficiency for learning, since
otherwise when performing structured learning for a CRF with CNN potentials it
is necessary to undertake expensive inference for every stochastic gradient
iteration. The network output dimension for message estimation is the same as
the number of classes, in contrast to the network output for general CNN
potential functions in CRFs, which is exponential in the order of the
potentials. Hence CNN message learning has fewer network parameters and is more
scalable for cases that a large number of classes are involved. We apply our
method to semantic image segmentation on the PASCAL VOC 2012 dataset. We
achieve an intersection-over-union score of 73.4 on its test set, which is the
best reported result for methods using the VOC training images alone. This
impressive performance demonstrates the effectiveness and usefulness of our CNN
message learning method.Comment: 11 pages. Appearing in Proc. The Twenty-ninth Annual Conference on
Neural Information Processing Systems (NIPS), 2015, Montreal, Canad
A Projected Gradient Descent Method for CRF Inference allowing End-To-End Training of Arbitrary Pairwise Potentials
Are we using the right potential functions in the Conditional Random Field
models that are popular in the Vision community? Semantic segmentation and
other pixel-level labelling tasks have made significant progress recently due
to the deep learning paradigm. However, most state-of-the-art structured
prediction methods also include a random field model with a hand-crafted
Gaussian potential to model spatial priors, label consistencies and
feature-based image conditioning.
In this paper, we challenge this view by developing a new inference and
learning framework which can learn pairwise CRF potentials restricted only by
their dependence on the image pixel values and the size of the support. Both
standard spatial and high-dimensional bilateral kernels are considered. Our
framework is based on the observation that CRF inference can be achieved via
projected gradient descent and consequently, can easily be integrated in deep
neural networks to allow for end-to-end training. It is empirically
demonstrated that such learned potentials can improve segmentation accuracy and
that certain label class interactions are indeed better modelled by a
non-Gaussian potential. In addition, we compare our inference method to the
commonly used mean-field algorithm. Our framework is evaluated on several
public benchmarks for semantic segmentation with improved performance compared
to previous state-of-the-art CNN+CRF models.Comment: Presented at EMMCVPR 2017 conferenc
- …