Semantic segmentation arises as the backbone of many vision systems, spanning
from self-driving cars and robot navigation to augmented reality and
teleconferencing. Frequently operating under stringent latency constraints
within a limited resource envelope, optimising for efficient execution becomes
important. At the same time, the heterogeneous capabilities of the target
platforms and the diverse constraints of different applications require the
design and training of multiple target-specific segmentation models, leading to
excessive maintenance costs. To this end, we propose a framework for converting
state-of-the-art segmentation CNNs to Multi-Exit Semantic Segmentation (MESS)
networks: specially trained models that employ parametrised early exits along
their depth to i) dynamically save computation during inference on easier
samples and ii) save training and maintenance cost by offering a post-training
customisable speed-accuracy trade-off. Designing and training such networks
naively can hurt performance. Thus, we propose a novel two-staged training
scheme for multi-exit networks. Furthermore, the parametrisation of MESS
enables co-optimising the number, placement and architecture of the attached
segmentation heads along with the exit policy, upon deployment via exhaustive
search in <1 GPUh. This allows MESS to rapidly adapt to the device capabilities
and application requirements for each target use-case, offering a
train-once-deploy-everywhere solution. MESS variants achieve latency gains of
up to 2.83x with the same accuracy, or 5.33 pp higher accuracy for the same
computational budget, compared to the original backbone network. Lastly, MESS
delivers orders of magnitude faster architectural customisation, compared to
state-of-the-art techniques.Comment: (Extended version) Accepted at ECCV 202