36,555 research outputs found
Multi-Person Pose Estimation via Column Generation
We study the problem of multi-person pose estimation in natural images. A
pose estimate describes the spatial position and identity (head, foot, knee,
etc.) of every non-occluded body part of a person. Pose estimation is difficult
due to issues such as deformation and variation in body configurations and
occlusion of parts, while multi-person settings add complications such as an
unknown number of people, with unknown appearance and possible interactions in
their poses and part locations. We give a novel integer program formulation of
the multi-person pose estimation problem, in which variables correspond to
assignments of parts in the image to poses in a two-tier, hierarchical way.
This enables us to develop an efficient custom optimization procedure based on
column generation, where columns are produced by exact optimization of very
small scale integer programs. We demonstrate improved accuracy and speed for
our method on the MPII multi-person pose estimation benchmark
Efficient Multi-Person Pose Estimation with Provable Guarantees
Multi-person pose estimation (MPPE) in natural images is key to the
meaningful use of visual data in many fields including movement science,
security, and rehabilitation. In this paper we tackle MPPE with a bottom-up
approach, starting with candidate detections of body parts from a convolutional
neural network (CNN) and grouping them into people. We formulate the grouping
of body part detections into people as a minimum-weight set packing (MWSP)
problem where the set of potential people is the power set of body part
detections. We model the quality of a hypothesis of a person which is a set in
the MWSP by an augmented tree-structured Markov random field where variables
correspond to body-parts and their state-spaces correspond to the power set of
the detections for that part.
We describe a novel algorithm that combines efficiency with provable bounds
on this MWSP problem. We employ an implicit column generation strategy where
the pricing problem is formulated as a dynamic program. To efficiently solve
this dynamic program we exploit the problem structure utilizing a nested
Bender's decomposition (NBD) exact inference strategy which we speed up by
recycling Bender's rows between calls to the pricing problem.
We test our approach on the MPII-Multiperson dataset, showing that our
approach obtains comparable results with the state-of-the-art algorithm for
joint node labeling and grouping problems, and that NBD achieves considerable
speed-ups relative to a naive dynamic programming approach. Typical algorithms
that solve joint node labeling and grouping problems use heuristics and thus
can not obtain proofs of optimality. Our approach, in contrast, proves that for
over 99 percent of problem instances we find the globally optimal solution and
otherwise provide upper/lower bounds
Exploiting skeletal structure in computer vision annotation with Benders decomposition
Many annotation problems in computer vision can be phrased as integer linear
programs (ILPs). The use of standard industrial solvers does not to exploit the
underlying structure of such problems eg, the skeleton in pose estimation. The
leveraging of the underlying structure in conjunction with industrial solvers
promises increases in both speed and accuracy. Such structure can be exploited
using Bender's decomposition, a technique from operations research, that solves
complex ILPs or mixed integer linear programs by decomposing them into
sub-problems that communicate via a master problem. The intuition is that
conditioned on a small subset of the variables the solution to the remaining
variables can be computed easily by taking advantage of properties of the ILP
constraint matrix such as block structure. In this paper we apply Benders
decomposition to a typical problem in computer vision where we have many
sub-ILPs (eg, partitioning of detections, body-parts) coupled to a master ILP
(eg, constructing skeletons). Dividing inference problems into a master problem
and sub-problems motivates the development of a plethora of novel models, and
inference approaches for the field of computer vision
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
Supervised deep learning with pixel-wise training labels has great successes
on multi-person part segmentation. However, data labeling at pixel-level is
very expensive. To solve the problem, people have been exploring to use
synthetic data to avoid the data labeling. Although it is easy to generate
labels for synthetic data, the results are much worse compared to those using
real data and manual labeling. The degradation of the performance is mainly due
to the domain gap, i.e., the discrepancy of the pixel value statistics between
real and synthetic data. In this paper, we observe that real and synthetic
humans both have a skeleton (pose) representation. We found that the skeletons
can effectively bridge the synthetic and real domains during the training. Our
proposed approach takes advantage of the rich and realistic variations of the
real data and the easily obtainable labels of the synthetic data to learn
multi-person part segmentation on real images without any human-annotated
labels. Through experiments, we show that without any human labeling, our
method performs comparably to several state-of-the-art approaches which require
human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other
hand, if part labels are also available in the real-images during training, our
method outperforms the supervised state-of-the-art methods by a large margin.
We further demonstrate the generalizability of our method on predicting novel
keypoints in real images where no real data labels are available for the novel
keypoints detection. Code and pre-trained models are available at
https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video
Technology; Presented at ICCV 2019 Demonstratio
PedX: Benchmark Dataset for Metric 3D Pose Estimation of Pedestrians in Complex Urban Intersections
This paper presents a novel dataset titled PedX, a large-scale multimodal
collection of pedestrians at complex urban intersections. PedX consists of more
than 5,000 pairs of high-resolution (12MP) stereo images and LiDAR data along
with providing 2D and 3D labels of pedestrians. We also present a novel 3D
model fitting algorithm for automatic 3D labeling harnessing constraints across
different modalities and novel shape and temporal priors. All annotated 3D
pedestrians are localized into the real-world metric space, and the generated
3D models are validated using a mocap system configured in a controlled outdoor
environment to simulate pedestrians in urban intersections. We also show that
the manual 2D labels can be replaced by state-of-the-art automated labeling
approaches, thereby facilitating automatic generation of large scale datasets
4D Visualization of Dynamic Events from Unconstrained Multi-View Videos
We present a data-driven approach for 4D space-time visualization of dynamic
events from videos captured by hand-held multiple cameras. Key to our approach
is the use of self-supervised neural networks specific to the scene to compose
static and dynamic aspects of an event. Though captured from discrete
viewpoints, this model enables us to move around the space-time of the event
continuously. This model allows us to create virtual cameras that facilitate:
(1) freezing the time and exploring views; (2) freezing a view and moving
through time; and (3) simultaneously changing both time and view. We can also
edit the videos and reveal occluded objects for a given view if it is visible
in any of the other views. We validate our approach on challenging in-the-wild
events captured using up to 15 mobile cameras.Comment: Project Page - http://www.cs.cmu.edu/~aayushb/Open4D
Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation
Recent studies have shown remarkable advances in 3D human pose estimation
from monocular images, with the help of large-scale in-door 3D datasets and
sophisticated network architectures. However, the generalizability to different
environments remains an elusive goal. In this work, we propose a geometry-aware
3D representation for the human pose to address this limitation by using
multiple views in a simple auto-encoder model at the training stage and only 2D
keypoint information as supervision. A view synthesis framework is proposed to
learn the shared 3D representation between viewpoints with synthesizing the
human pose from one viewpoint to the other one. Instead of performing a direct
transfer in the raw image-level, we propose a skeleton-based encoder-decoder
mechanism to distil only pose-related representation in the latent space. A
learning-based representation consistency constraint is further introduced to
facilitate the robustness of latent 3D representation. Since the learnt
representation encodes 3D geometry information, mapping it to 3D pose will be
much easier than conventional frameworks that use an image or 2D coordinates as
the input of 3D pose estimator. We demonstrate our approach on the task of 3D
human pose estimation. Comprehensive experiments on three popular benchmarks
show that our model can significantly improve the performance of
state-of-the-art methods with simply injecting the representation as a robust
3D prior.Comment: Accepted as a CVPR 2019 oral paper. Project page:
https://kwanyeelin.github.io
Toward Characteristic-Preserving Image-based Virtual Try-On Network
Image-based virtual try-on systems for fitting new in-shop clothes into a
person image have attracted increasing research attention, yet is still
challenging. A desirable pipeline should not only transform the target clothes
into the most fitting shape seamlessly but also preserve well the clothes
identity in the generated image, that is, the key characteristics (e.g.
texture, logo, embroidery) that depict the original clothes. However, previous
image-conditioned generation works fail to meet these critical requirements
towards the plausible virtual try-on performance since they fail to handle
large spatial misalignment between the input image and target clothes. Prior
work explicitly tackled spatial deformation using shape context matching, but
failed to preserve clothing details due to its coarse-to-fine strategy. In this
work, we propose a new fully-learnable Characteristic-Preserving Virtual Try-On
Network(CP-VTON) for addressing all real-world challenges in this task. First,
CP-VTON learns a thin-plate spline transformation for transforming the in-shop
clothes into fitting the body shape of the target person via a new Geometric
Matching Module (GMM) rather than computing correspondences of interest points
as prior works did. Second, to alleviate boundary artifacts of warped clothes
and make the results more realistic, we employ a Try-On Module that learns a
composition mask to integrate the warped clothes and the rendered image to
ensure smoothness. Extensive experiments on a fashion dataset demonstrate our
CP-VTON achieves the state-of-the-art virtual try-on performance both
qualitatively and quantitatively.Comment: Accepted by ECCV 201
Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis
Despite remarkable advances in image synthesis research, existing works often
fail in manipulating images under the context of large geometric
transformations. Synthesizing person images conditioned on arbitrary poses is
one of the most representative examples where the generation quality largely
relies on the capability of identifying and modeling arbitrary transformations
on different body parts. Current generative models are often built on local
convolutions and overlook the key challenges (e.g. heavy occlusions, different
views or dramatic appearance changes) when distinct geometric changes happen
for each part, caused by arbitrary pose manipulations. This paper aims to
resolve these challenges induced by geometric variability and spatial
displacements via a new Soft-Gated Warping Generative Adversarial Network
(Warping-GAN), which is composed of two stages: 1) it first synthesizes a
target part segmentation map given a target pose, which depicts the
region-level spatial layouts for guiding image synthesis with higher-level
structure constraints; 2) the Warping-GAN equipped with a soft-gated
warping-block learns feature-level mapping to render textures from the original
image into the generated segmentation map. Warping-GAN is capable of
controlling different transformation degrees given distinct target poses.
Moreover, the proposed warping-block is light-weight and flexible enough to be
injected into any networks. Human perceptual studies and quantitative
evaluations demonstrate the superiority of our Warping-GAN that significantly
outperforms all existing methods on two large datasets.Comment: 17 pages, 14 figure
Unsupervised Part-Based Disentangling of Object Shape and Appearance
Large intra-class variation is the result of changes in multiple object
characteristics. Images, however, only show the superposition of different
variable factors such as appearance or shape. Therefore, learning to
disentangle and represent these different characteristics poses a great
challenge, especially in the unsupervised case. Moreover, large object
articulation calls for a flexible part-based model. We present an unsupervised
approach for disentangling appearance and shape by learning parts consistently
over all instances of a category. Our model for learning an object
representation is trained by simultaneously exploiting invariance and
equivariance constraints between synthetically transformed images. Since no
part annotation or prior information on an object class is required, the
approach is applicable to arbitrary classes. We evaluate our approach on a wide
range of object categories and diverse tasks including pose prediction,
disentangled image synthesis, and video-to-video translation. The approach
outperforms the state-of-the-art on unsupervised keypoint prediction and
compares favorably even against supervised approaches on the task of shape and
appearance transfer.Comment: CVPR 2019 Ora
- …