5 research outputs found

    Novel-View Human Action Synthesis

    Full text link
    Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolution or occlusions. We produce a semi-dense textured mesh by propagating the transferred textures both locally, within local geodesic neighborhoods, and globally, across symmetric semantic parts. Next, we introduce a context-based generator to learn how to correct and complete the residual appearance information. This allows the network to independently focus on learning the foreground and background synthesis tasks. We validate the proposed solution on the public NTU RGB+D dataset. The code and resources are available at https://bit.ly/36u3h4K.Comment: Asian Conference on Computer Vision (ACCV) 202

    3D Mesh and Pose Recovery of a Foot from Single Image

    Get PDF
    The pandemic and the major shift to online shopping has highlighted the current difficulties in getting proper sizing for clothing and shoes. Being able to accurately measure shoes using readily available smartphones would help in minimizing returns and trying to get a better fit. Being able to reconstruct the 3D geometry of a foot irregardless of the foot pose using a smartphone would help for the online shoe shopping experience. Usually, systems reconstructing a 3D foot require the foot to be in a canonical pose or require multiple perspectives. There is no system to our knowledge that allows capturing the precise pose of the foot without expensive equipment. In many situations, the canonical pose or the multiple views are not feasible. Therefore, we propose a system that can infer the 3D reconstruction and the pose estimation of the foot from any pose in only one image. Our kinematic model, based on popular biomechanical models, is made of 18 rotating joints. To obtain the 3D reconstruction, we extract the silhouette of the foot and its joint landmarks from the image space. From the silhouette and the relation between each joint landmark, we can define the shape of the 3D mesh. Most 3D reconstruction algorithms work with up-convolutions which do not preserve the global information of the reconstructed object. Using a template mesh model of the foot and a spatial convolution network designed to learn from sparse data, we are able to recover the local features without losing sight of the global information. To develop the template mesh, we deformed the meshes of a dataset of 3D feet so they can be used to design a PCA model. The template mesh is the PCA model with no variance added to its components. To obtain the 3D pose, we have labelled the vertices of the template mesh according to the joints of our kinematic model. Those labels can be used to estimate the 3D pose from the 3D reconstruction by corresponding the two meshes. To be able to train the system, we needed a good dataset. Since, there was no viable one available, we decided to create our own dataset by using the previously described PCA model of the foot to generate random 3D meshes of feet. We used mesh deformation and inverse kinematics to capture the feet in different poses. Our system showed a good ability to generate detailed feet. However, we could not predict a reliable length and width for each foot since our virtual dataset does not support scaling indications of any kind, other than the ground truths. Our experiments led to an average error of 13.65 mm on the length and 5.72 mm on the width, which is too high to recommend footwear. To ameliorate the performance of our system, the 2D joints detection method could be modified to use the structure of the foot described by our kinematic foot model as a guide to detect more accurately the position of the joints. The loss functions used for 3D reconstruction should also be revisited to generate more reliable reconstructions

    Modality-Based Multi-View Indoor Video Synthesis

    Get PDF
    This thesis aims at reproducing the video of an indoor scene as seen from another, targeted, view using modalities such as depth and skeleton as guidance. However, synthesizing the video containing a moving person is challenging due to the camera placement in the scene that causes scale difference and self-occlusion. The other key challenge is maintaining temporal consistency across the synthesized frames. Current state-of-the-art methods focus on synthesizing each frame separately, which can cause the loss of the motion information contained in the input view. Therefore, we need to model the temporal consistency for a smooth transitioning between the synthesized frames. We consider a neural network-based approach and use the body skeleton as a driving cue, visible texture transfer for self-occlusion, and recurrent neural network to maintain temporal consistency in the feature space. We propose a 2D-based synthesis network that specifically disentangles the encoding of the input image and the target pose which allows learning better features that lead to better image synthesis. We also propose a training strategy based on a pixel-wise loss function that improves high-frequency details to enhance the visual quality of the synthesized images. Moreover, we propose a novel masking scheme to account for the scale difference and the spatial shift and deformation between the input and output skeleton. We propose a new formulation of the 2D-based synthesis network to address the temporal consistency constraint on the synthesized multi-view frames. In particular, we extend recurrent neural networks to learn a spatiotemporal feature space that preserves the texture and approximates the targeted view. In addition, we propose a hybrid approach combining a direct texture transfer of the visible pixel from the input to the targeted view and a 3D-based synthesis network for refinement. Experimental results on standard image and multi-view video benchmarks improve existing alternatives in terms of visual quality and the smoothness of the synthesized frames

    Human Synthesis and Scene Compositing

    No full text
    corecore