Convolutional Neural Networks (CNNs) are successfully used for the important
automotive visual perception tasks including object recognition, motion and
depth estimation, visual SLAM, etc. However, these tasks are typically
independently explored and modeled. In this paper, we propose a joint
multi-task network design for learning several tasks simultaneously. Our main
motivation is the computational efficiency achieved by sharing the expensive
initial convolutional layers between all tasks. Indeed, the main bottleneck in
automated driving systems is the limited processing power available on
deployment hardware. There is also some evidence for other benefits in
improving accuracy for some tasks and easing development effort. It also offers
scalability to add more tasks leveraging existing features and achieving better
generalization. We survey various CNN based solutions for visual perception
tasks in automated driving. Then we propose a unified CNN model for the
important tasks and discuss several advanced optimization and architecture
design techniques to improve the baseline model. The paper is partly review and
partly positional with demonstration of several preliminary results promising
for future research. We first demonstrate results of multi-stream learning and
auxiliary learning which are important ingredients to scale to a large
multi-task model. Finally, we implement a two-stream three-task network which
performs better in many cases compared to their corresponding single-task
models, while maintaining network size.Comment: Accepted for Oral Presentation at IEEE Intelligent Transportation
Systems Conference (ITSC) 201