36,555 research outputs found

    Multi-Person Pose Estimation via Column Generation

    Full text link
    We study the problem of multi-person pose estimation in natural images. A pose estimate describes the spatial position and identity (head, foot, knee, etc.) of every non-occluded body part of a person. Pose estimation is difficult due to issues such as deformation and variation in body configurations and occlusion of parts, while multi-person settings add complications such as an unknown number of people, with unknown appearance and possible interactions in their poses and part locations. We give a novel integer program formulation of the multi-person pose estimation problem, in which variables correspond to assignments of parts in the image to poses in a two-tier, hierarchical way. This enables us to develop an efficient custom optimization procedure based on column generation, where columns are produced by exact optimization of very small scale integer programs. We demonstrate improved accuracy and speed for our method on the MPII multi-person pose estimation benchmark

    Efficient Multi-Person Pose Estimation with Provable Guarantees

    Full text link
    Multi-person pose estimation (MPPE) in natural images is key to the meaningful use of visual data in many fields including movement science, security, and rehabilitation. In this paper we tackle MPPE with a bottom-up approach, starting with candidate detections of body parts from a convolutional neural network (CNN) and grouping them into people. We formulate the grouping of body part detections into people as a minimum-weight set packing (MWSP) problem where the set of potential people is the power set of body part detections. We model the quality of a hypothesis of a person which is a set in the MWSP by an augmented tree-structured Markov random field where variables correspond to body-parts and their state-spaces correspond to the power set of the detections for that part. We describe a novel algorithm that combines efficiency with provable bounds on this MWSP problem. We employ an implicit column generation strategy where the pricing problem is formulated as a dynamic program. To efficiently solve this dynamic program we exploit the problem structure utilizing a nested Bender's decomposition (NBD) exact inference strategy which we speed up by recycling Bender's rows between calls to the pricing problem. We test our approach on the MPII-Multiperson dataset, showing that our approach obtains comparable results with the state-of-the-art algorithm for joint node labeling and grouping problems, and that NBD achieves considerable speed-ups relative to a naive dynamic programming approach. Typical algorithms that solve joint node labeling and grouping problems use heuristics and thus can not obtain proofs of optimality. Our approach, in contrast, proves that for over 99 percent of problem instances we find the globally optimal solution and otherwise provide upper/lower bounds

    Exploiting skeletal structure in computer vision annotation with Benders decomposition

    Full text link
    Many annotation problems in computer vision can be phrased as integer linear programs (ILPs). The use of standard industrial solvers does not to exploit the underlying structure of such problems eg, the skeleton in pose estimation. The leveraging of the underlying structure in conjunction with industrial solvers promises increases in both speed and accuracy. Such structure can be exploited using Bender's decomposition, a technique from operations research, that solves complex ILPs or mixed integer linear programs by decomposing them into sub-problems that communicate via a master problem. The intuition is that conditioned on a small subset of the variables the solution to the remaining variables can be computed easily by taking advantage of properties of the ILP constraint matrix such as block structure. In this paper we apply Benders decomposition to a typical problem in computer vision where we have many sub-ILPs (eg, partitioning of detections, body-parts) coupled to a master ILP (eg, constructing skeletons). Dividing inference problems into a master problem and sub-problems motivates the development of a plethora of novel models, and inference approaches for the field of computer vision

    Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation

    Full text link
    Supervised deep learning with pixel-wise training labels has great successes on multi-person part segmentation. However, data labeling at pixel-level is very expensive. To solve the problem, people have been exploring to use synthetic data to avoid the data labeling. Although it is easy to generate labels for synthetic data, the results are much worse compared to those using real data and manual labeling. The degradation of the performance is mainly due to the domain gap, i.e., the discrepancy of the pixel value statistics between real and synthetic data. In this paper, we observe that real and synthetic humans both have a skeleton (pose) representation. We found that the skeletons can effectively bridge the synthetic and real domains during the training. Our proposed approach takes advantage of the rich and realistic variations of the real data and the easily obtainable labels of the synthetic data to learn multi-person part segmentation on real images without any human-annotated labels. Through experiments, we show that without any human labeling, our method performs comparably to several state-of-the-art approaches which require human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other hand, if part labels are also available in the real-images during training, our method outperforms the supervised state-of-the-art methods by a large margin. We further demonstrate the generalizability of our method on predicting novel keypoints in real images where no real data labels are available for the novel keypoints detection. Code and pre-trained models are available at https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video Technology; Presented at ICCV 2019 Demonstratio

    PedX: Benchmark Dataset for Metric 3D Pose Estimation of Pedestrians in Complex Urban Intersections

    Full text link
    This paper presents a novel dataset titled PedX, a large-scale multimodal collection of pedestrians at complex urban intersections. PedX consists of more than 5,000 pairs of high-resolution (12MP) stereo images and LiDAR data along with providing 2D and 3D labels of pedestrians. We also present a novel 3D model fitting algorithm for automatic 3D labeling harnessing constraints across different modalities and novel shape and temporal priors. All annotated 3D pedestrians are localized into the real-world metric space, and the generated 3D models are validated using a mocap system configured in a controlled outdoor environment to simulate pedestrians in urban intersections. We also show that the manual 2D labels can be replaced by state-of-the-art automated labeling approaches, thereby facilitating automatic generation of large scale datasets

    4D Visualization of Dynamic Events from Unconstrained Multi-View Videos

    Full text link
    We present a data-driven approach for 4D space-time visualization of dynamic events from videos captured by hand-held multiple cameras. Key to our approach is the use of self-supervised neural networks specific to the scene to compose static and dynamic aspects of an event. Though captured from discrete viewpoints, this model enables us to move around the space-time of the event continuously. This model allows us to create virtual cameras that facilitate: (1) freezing the time and exploring views; (2) freezing a view and moving through time; and (3) simultaneously changing both time and view. We can also edit the videos and reveal occluded objects for a given view if it is visible in any of the other views. We validate our approach on challenging in-the-wild events captured using up to 15 mobile cameras.Comment: Project Page - http://www.cs.cmu.edu/~aayushb/Open4D

    Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation

    Full text link
    Recent studies have shown remarkable advances in 3D human pose estimation from monocular images, with the help of large-scale in-door 3D datasets and sophisticated network architectures. However, the generalizability to different environments remains an elusive goal. In this work, we propose a geometry-aware 3D representation for the human pose to address this limitation by using multiple views in a simple auto-encoder model at the training stage and only 2D keypoint information as supervision. A view synthesis framework is proposed to learn the shared 3D representation between viewpoints with synthesizing the human pose from one viewpoint to the other one. Instead of performing a direct transfer in the raw image-level, we propose a skeleton-based encoder-decoder mechanism to distil only pose-related representation in the latent space. A learning-based representation consistency constraint is further introduced to facilitate the robustness of latent 3D representation. Since the learnt representation encodes 3D geometry information, mapping it to 3D pose will be much easier than conventional frameworks that use an image or 2D coordinates as the input of 3D pose estimator. We demonstrate our approach on the task of 3D human pose estimation. Comprehensive experiments on three popular benchmarks show that our model can significantly improve the performance of state-of-the-art methods with simply injecting the representation as a robust 3D prior.Comment: Accepted as a CVPR 2019 oral paper. Project page: https://kwanyeelin.github.io

    Toward Characteristic-Preserving Image-based Virtual Try-On Network

    Full text link
    Image-based virtual try-on systems for fitting new in-shop clothes into a person image have attracted increasing research attention, yet is still challenging. A desirable pipeline should not only transform the target clothes into the most fitting shape seamlessly but also preserve well the clothes identity in the generated image, that is, the key characteristics (e.g. texture, logo, embroidery) that depict the original clothes. However, previous image-conditioned generation works fail to meet these critical requirements towards the plausible virtual try-on performance since they fail to handle large spatial misalignment between the input image and target clothes. Prior work explicitly tackled spatial deformation using shape context matching, but failed to preserve clothing details due to its coarse-to-fine strategy. In this work, we propose a new fully-learnable Characteristic-Preserving Virtual Try-On Network(CP-VTON) for addressing all real-world challenges in this task. First, CP-VTON learns a thin-plate spline transformation for transforming the in-shop clothes into fitting the body shape of the target person via a new Geometric Matching Module (GMM) rather than computing correspondences of interest points as prior works did. Second, to alleviate boundary artifacts of warped clothes and make the results more realistic, we employ a Try-On Module that learns a composition mask to integrate the warped clothes and the rendered image to ensure smoothness. Extensive experiments on a fashion dataset demonstrate our CP-VTON achieves the state-of-the-art virtual try-on performance both qualitatively and quantitatively.Comment: Accepted by ECCV 201

    Soft-Gated Warping-GAN for Pose-Guided Person Image Synthesis

    Full text link
    Despite remarkable advances in image synthesis research, existing works often fail in manipulating images under the context of large geometric transformations. Synthesizing person images conditioned on arbitrary poses is one of the most representative examples where the generation quality largely relies on the capability of identifying and modeling arbitrary transformations on different body parts. Current generative models are often built on local convolutions and overlook the key challenges (e.g. heavy occlusions, different views or dramatic appearance changes) when distinct geometric changes happen for each part, caused by arbitrary pose manipulations. This paper aims to resolve these challenges induced by geometric variability and spatial displacements via a new Soft-Gated Warping Generative Adversarial Network (Warping-GAN), which is composed of two stages: 1) it first synthesizes a target part segmentation map given a target pose, which depicts the region-level spatial layouts for guiding image synthesis with higher-level structure constraints; 2) the Warping-GAN equipped with a soft-gated warping-block learns feature-level mapping to render textures from the original image into the generated segmentation map. Warping-GAN is capable of controlling different transformation degrees given distinct target poses. Moreover, the proposed warping-block is light-weight and flexible enough to be injected into any networks. Human perceptual studies and quantitative evaluations demonstrate the superiority of our Warping-GAN that significantly outperforms all existing methods on two large datasets.Comment: 17 pages, 14 figure

    Unsupervised Part-Based Disentangling of Object Shape and Appearance

    Full text link
    Large intra-class variation is the result of changes in multiple object characteristics. Images, however, only show the superposition of different variable factors such as appearance or shape. Therefore, learning to disentangle and represent these different characteristics poses a great challenge, especially in the unsupervised case. Moreover, large object articulation calls for a flexible part-based model. We present an unsupervised approach for disentangling appearance and shape by learning parts consistently over all instances of a category. Our model for learning an object representation is trained by simultaneously exploiting invariance and equivariance constraints between synthetically transformed images. Since no part annotation or prior information on an object class is required, the approach is applicable to arbitrary classes. We evaluate our approach on a wide range of object categories and diverse tasks including pose prediction, disentangled image synthesis, and video-to-video translation. The approach outperforms the state-of-the-art on unsupervised keypoint prediction and compares favorably even against supervised approaches on the task of shape and appearance transfer.Comment: CVPR 2019 Ora
    corecore