The technology of hyperspectral imaging (HSI) records the visual information
upon long-range-distributed spectral wavelengths. A representative
hyperspectral image acquisition procedure conducts a 3D-to-2D encoding by the
coded aperture snapshot spectral imager (CASSI) and requires a software decoder
for the 3D signal reconstruction. By observing this physical encoding
procedure, two major challenges stand in the way of a high-fidelity
reconstruction. (i) To obtain 2D measurements, CASSI dislocates multiple
channels by disperser-titling and squeezes them onto the same spatial region,
yielding an entangled data loss. (ii) The physical coded aperture leads to a
masked data loss by selectively blocking the pixel-wise light exposure. To
tackle these challenges, we propose a spatial-spectral (S^2-) Transformer
network with a mask-aware learning strategy. First, we simultaneously leverage
spatial and spectral attention modeling to disentangle the blended information
in the 2D measurement along both two dimensions. A series of Transformer
structures are systematically designed to fully investigate the spatial and
spectral informative properties of the hyperspectral data. Second, the masked
pixels will induce higher prediction difficulty and should be treated
differently from unmasked ones. Thereby, we adaptively prioritize the loss
penalty attributing to the mask structure by inferring the pixel-wise
reconstruction difficulty upon the mask-encoded prediction. We theoretically
discusses the distinct convergence tendencies between masked/unmasked regions
of the proposed learning strategy. Extensive experiments demonstrates that the
proposed method achieves superior reconstruction performance. Additionally, we
empirically elaborate the behaviour of spatial and spectral attentions under
the proposed architecture, and comprehensively examine the impact of the
mask-aware learning.Comment: 11 pages, 16 figures, 6 tables, Code:
https://github.com/Jiamian-Wang/S2-transformer-HS