HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object Detection

Abstract

Besides standard cameras, autonomous vehicles typically include multipleadditional sensors, such as lidars and radars, which help acquire richerinformation for perceiving the content of the driving scene. While severalrecent works focus on fusing certain pairs of sensors - such as camera andlidar or camera and radar - by using architectural components specific to theexamined setting, a generic and modular sensor fusion architecture is missingfrom the literature. In this work, we focus on 2D object detection, afundamental high-level task which is defined on the 2D image domain, andpropose HRFuser, a multi-resolution sensor fusion architecture that scalesstraightforwardly to an arbitrary number of input modalities. The design ofHRFuser is based on state-of-the-art high-resolution networks for image-onlydense prediction and incorporates a novel multi-window cross-attention block asthe means to perform fusion of multiple modalities at multiple resolutions.Even though cameras alone provide very informative features for 2D detection,we demonstrate via extensive experiments on the nuScenes and Seeing Through Fogdatasets that our model effectively leverages complementary features fromadditional modalities, substantially improving upon camera-only performance andconsistently outperforming state-of-the-art fusion methods for 2D detectionboth in normal and adverse conditions. The source code will be made publiclyavailable.<br

    Similar works