5 research outputs found
Novel-View Human Action Synthesis
Novel-View Human Action Synthesis aims to synthesize the movement of a body
from a virtual viewpoint, given a video from a real viewpoint. We present a
novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D
mesh of the target body and transfer the rough textures from the 2D images to
the mesh. As this transfer may generate sparse textures on the mesh due to
frame resolution or occlusions. We produce a semi-dense textured mesh by
propagating the transferred textures both locally, within local geodesic
neighborhoods, and globally, across symmetric semantic parts. Next, we
introduce a context-based generator to learn how to correct and complete the
residual appearance information. This allows the network to independently focus
on learning the foreground and background synthesis tasks. We validate the
proposed solution on the public NTU RGB+D dataset. The code and resources are
available at https://bit.ly/36u3h4K.Comment: Asian Conference on Computer Vision (ACCV) 202
3D Mesh and Pose Recovery of a Foot from Single Image
The pandemic and the major shift to online shopping has highlighted the current difficulties in getting proper sizing for clothing and shoes. Being able to accurately measure
shoes using readily available smartphones would help in minimizing returns and trying to
get a better fit. Being able to reconstruct the 3D geometry of a foot irregardless of the
foot pose using a smartphone would help for the online shoe shopping experience. Usually,
systems reconstructing a 3D foot require the foot to be in a canonical pose or require multiple perspectives. There is no system to our knowledge that allows capturing the precise
pose of the foot without expensive equipment. In many situations, the canonical pose or
the multiple views are not feasible. Therefore, we propose a system that can infer the 3D
reconstruction and the pose estimation of the foot from any pose in only one image. Our
kinematic model, based on popular biomechanical models, is made of 18 rotating joints. To
obtain the 3D reconstruction, we extract the silhouette of the foot and its joint landmarks
from the image space. From the silhouette and the relation between each joint landmark,
we can define the shape of the 3D mesh. Most 3D reconstruction algorithms work with
up-convolutions which do not preserve the global information of the reconstructed object.
Using a template mesh model of the foot and a spatial convolution network designed to
learn from sparse data, we are able to recover the local features without losing sight of the
global information. To develop the template mesh, we deformed the meshes of a dataset of
3D feet so they can be used to design a PCA model. The template mesh is the PCA model
with no variance added to its components. To obtain the 3D pose, we have labelled the
vertices of the template mesh according to the joints of our kinematic model. Those labels
can be used to estimate the 3D pose from the 3D reconstruction by corresponding the two
meshes. To be able to train the system, we needed a good dataset. Since, there was no viable one available, we decided to create our own dataset by using the previously described
PCA model of the foot to generate random 3D meshes of feet. We used mesh deformation
and inverse kinematics to capture the feet in different poses. Our system showed a good
ability to generate detailed feet. However, we could not predict a reliable length and width
for each foot since our virtual dataset does not support scaling indications of any kind,
other than the ground truths. Our experiments led to an average error of 13.65 mm on the
length and 5.72 mm on the width, which is too high to recommend footwear. To ameliorate
the performance of our system, the 2D joints detection method could be modified to use
the structure of the foot described by our kinematic foot model as a guide to detect more
accurately the position of the joints. The loss functions used for 3D reconstruction should
also be revisited to generate more reliable reconstructions
Modality-Based Multi-View Indoor Video Synthesis
This thesis aims at reproducing the video of an indoor scene as seen from another, targeted, view using modalities such as depth and skeleton as guidance. However, synthesizing the video containing a moving person is challenging due to the camera placement in the scene that causes scale difference and self-occlusion. The other key challenge is maintaining temporal consistency across the synthesized frames. Current state-of-the-art methods focus on synthesizing each frame separately, which can cause the loss of the motion information contained in the input view. Therefore, we need to model the temporal consistency for a smooth transitioning between the synthesized frames. We consider a neural network-based approach and use the body skeleton as a driving cue, visible texture transfer for self-occlusion, and recurrent neural network to maintain temporal consistency in the feature space. We propose a 2D-based synthesis network that specifically disentangles the encoding of the input image and the target pose which allows learning better features that lead to better image synthesis. We also propose a training strategy based on a pixel-wise loss function that improves high-frequency details to enhance the visual quality of the synthesized images. Moreover, we propose a novel masking scheme to account for the scale difference and the spatial shift and deformation between the input and output skeleton. We propose a new formulation of the 2D-based synthesis network to address the temporal consistency constraint on the synthesized multi-view frames. In particular, we extend recurrent neural networks to learn a spatiotemporal feature space that preserves the texture and approximates the targeted view. In addition, we propose a hybrid approach combining a direct texture transfer of the visible pixel from the input to the targeted view and a 3D-based synthesis network for refinement. Experimental results on standard image and multi-view video benchmarks improve existing alternatives in terms of visual quality and the smoothness of the synthesized frames