Existing deep calibrated photometric stereo networks basically aggregate
observations under different lights based on the pre-defined operations such as
linear projection and max pooling. While they are effective with the dense
capture, simple first-order operations often fail to capture the high-order
interactions among observations under small number of different lights. To
tackle this issue, this paper presents a deep sparse calibrated photometric
stereo network named {\it PS-Transformer} which leverages the learnable
self-attention mechanism to properly capture the complex inter-image
interactions. PS-Transformer builds upon the dual-branch design to explore both
pixel-wise and image-wise features and individual feature is trained with the
intermediate surface normal supervision to maximize geometric feasibility. A
new synthetic dataset named CyclesPS+ is also presented with the comprehensive
analysis to successfully train the photometric stereo networks. Extensive
results on the publicly available benchmark datasets demonstrate that the
surface normal prediction accuracy of the proposed method significantly
outperforms other state-of-the-art algorithms with the same number of input
images and is even comparable to that of dense algorithms which input
10× larger number of images.Comment: BMVC2021. Code and Supplementary are available at
https://github.com/satoshi-ikehata/PS-Transformer-BMVC202