We propose a dual-stream multi-scale vision transformer (DS-MSHViT)
architecture that processes RGB and optical flow inputs for efficient sewer
defect classification. Unlike existing methods that combine the predictions of
two separate networks trained on each modality, we jointly train a single
network with two branches for RGB and motion. Our key idea is to use
self-attention regularization to harness the complementary strengths of the RGB
and motion streams. The motion stream alone struggles to generate accurate
attention maps, as motion images lack the rich visual features present in RGB
images. To facilitate this, we introduce an attention consistency loss between
the dual streams. By leveraging motion cues through a self-attention
regularizer, we align and enhance RGB attention maps, enabling the network to
concentrate on pertinent input regions. We evaluate our data on a public
dataset as well as cross-validate our model performance in a novel dataset. Our
method outperforms existing models that utilize either convolutional neural
networks (CNNs) or multi-scale hybrid vision transformers (MSHViTs) without
employing attention regularization between the two streams