5 research outputs found
Error Correction for Dense Semantic Image Labeling
Pixelwise semantic image labeling is an important, yet challenging, task with
many applications. Typical approaches to tackle this problem involve either the
training of deep networks on vast amounts of images to directly infer the
labels or the use of probabilistic graphical models to jointly model the
dependencies of the input (i.e. images) and output (i.e. labels). Yet, the
former approaches do not capture the structure of the output labels, which is
crucial for the performance of dense labeling, and the latter rely on carefully
hand-designed priors that require costly parameter tuning via optimization
techniques, which in turn leads to long inference times. To alleviate these
restrictions, we explore how to arrive at dense semantic pixel labels given
both the input image and an initial estimate of the output labels. We propose a
parallel architecture that: 1) exploits the context information through a
LabelPropagation network to propagate correct labels from nearby pixels to
improve the object boundaries, 2) uses a LabelReplacement network to directly
replace possibly erroneous, initial labels with new ones, and 3) combines the
different intermediate results via a Fusion network to obtain the final
per-pixel label. We experimentally validate our approach on two different
datasets for the semantic segmentation and face parsing tasks respectively,
where we show improvements over the state-of-the-art. We also provide both a
quantitative and qualitative analysis of the generated results
Learning to Refine Human Pose Estimation
Multi-person pose estimation in images and videos is an important yet
challenging task with many applications. Despite the large improvements in
human pose estimation enabled by the development of convolutional neural
networks, there still exist a lot of difficult cases where even the
state-of-the-art models fail to correctly localize all body joints. This
motivates the need for an additional refinement step that addresses these
challenging cases and can be easily applied on top of any existing method. In
this work, we introduce a pose refinement network (PoseRefiner) which takes as
input both the image and a given pose estimate and learns to directly predict a
refined pose by jointly reasoning about the input-output space. In order for
the network to learn to refine incorrect body joint predictions, we employ a
novel data augmentation scheme for training, where we model "hard" human pose
cases. We evaluate our approach on four popular large-scale pose estimation
benchmarks such as MPII Single- and Multi-Person Pose Estimation, PoseTrack
Pose Estimation, and PoseTrack Pose Tracking, and report systematic improvement
over the state of the art.Comment: To appear in CVPRW (2018). Workshop: Visual Understanding of Humans
in Crowd Scene and the 2nd Look Into Person Challenge (VUHCS-LIP
Error Correction for Dense Semantic Image Labeling
© 2018 IEEE. Pixel-wise semantic image labeling is an important, yet challenging task with many applications. Especially in autonomous driving systems, it allows for a full understanding of the system's surroundings, which is crucial for trajectory planning. Typical approaches to tackle this problem involve either the training of deep networks on vast amounts of images to directly infer the labels or the use of probabilistic graphical models to jointly model the dependencies of the input (i.e. images) and output (i.e. labels). Yet, the former approaches do not capture the structure of the output labels, which is crucial for the performance of dense labeling, and the latter rely on carefully hand-designed priors that require costly parameter tuning via optimization techniques, which in turn leads to long inference times. To alleviate these restrictions, we explore how to arrive at dense semantic pixel labels given both the input image and an initial estimate of the output labels. We propose a parallel architecture that: 1) exploits the context information through a LabelPropagation network to propagate correct labels from nearby pixels to improve the object boundaries, 2) uses a LabelReplacement network to directly replace possibly erroneous, initial labels with new ones, and 3) combines the different intermediate results via a Fusion network to obtain the final per-pixel label. We experimentally validate our approach on two different datasets for semantic segmentation, where we show improvements over the state-of-the-art. We also provide both a quantitative and qualitative analysis of the generated results.status: publishe