Driver distraction causes a significant number of traffic accidents every
year, resulting in economic losses and casualties. Currently, the level of
automation in commercial vehicles is far from completely unmanned, and drivers
still play an important role in operating and controlling the vehicle.
Therefore, driver distraction behavior detection is crucial for road safety. At
present, driver distraction detection primarily relies on traditional
Convolutional Neural Networks (CNN) and supervised learning methods. However,
there are still challenges such as the high cost of labeled datasets, limited
ability to capture high-level semantic information, and weak generalization
performance. In order to solve these problems, this paper proposes a new
self-supervised learning method based on masked image modeling for driver
distraction behavior detection. Firstly, a self-supervised learning framework
for masked image modeling (MIM) is introduced to solve the serious human and
material consumption issues caused by dataset labeling. Secondly, the Swin
Transformer is employed as an encoder. Performance is enhanced by reconfiguring
the Swin Transformer block and adjusting the distribution of the number of
window multi-head self-attention (W-MSA) and shifted window multi-head
self-attention (SW-MSA) detection heads across all stages, which leads to model
more lightening. Finally, various data augmentation strategies are used along
with the best random masking strategy to strengthen the model's recognition and
generalization ability. Test results on a large-scale driver distraction
behavior dataset show that the self-supervised learning method proposed in this
paper achieves an accuracy of 99.60%, approximating the excellent performance
of advanced supervised learning methods