Multi-task visual perception has a wide range of applications in scene
understanding such as autonomous driving. In this work, we devise an efficient
unified framework to solve multiple common perception tasks, including instance
segmentation, semantic segmentation, monocular 3D detection, and depth
estimation. Simply sharing the same visual feature representations for these
tasks impairs the performance of tasks, while independent task-specific feature
extractors lead to parameter redundancy and latency. Thus, we design two
feature-merge branches to learn feature basis, which can be useful to, and thus
shared by, multiple perception tasks. Then, each task takes the corresponding
feature basis as the input of the prediction task head to fulfill a specific
task. In particular, one feature merge branch is designed for instance-level
recognition the other for dense predictions. To enhance inter-branch
communication, the instance branch passes pixel-wise spatial information of
each instance to the dense branch using efficient dynamic convolution
weighting. Moreover, a simple but effective dynamic routing mechanism is
proposed to isolate task-specific features and leverage common properties among
tasks. Our proposed framework, termed D2BNet, demonstrates a unique approach to
parameter-efficient predictions for multi-task perception. In addition, as
tasks benefit from co-training with each other, our solution achieves on par
results on partially labeled settings on nuScenes and outperforms previous
works for 3D detection and depth estimation on the Cityscapes dataset with full
supervision.Comment: Accepted by International Journal of Computer Vision. arXiv admin
note: text overlap with arXiv:2011.0979