LEARNABLE MASKS FOR POSE-GUIDED VIEW SYNTHESIS

Abstract

Pose-guided human view synthesis uses a target pose to generate the appearance of a new view of a person. The input view and the target pose can be processed separately with UNet architectures that combine the results in a late fusion stage. UNet architectures link their encoder and decoder with skip connections that preserve the location of spatial features by injecting input information in the decoding process. However, direct skip connections may transfer irrelevant information to the decoder. We overcome this limitation with learnable masks for skip connections that encourage the decoder to use only relevant information from the encoder. We show that adding the proposed mask to UNet architectures improves the performance of view synthesis with only a slight increase in inference time

    Similar works