The recent surge of interest surrounding Multimodal Neural Networks (MM-NN)
is attributed to their ability to effectively process and integrate multiscale
information from diverse data sources. MM-NNs extract and fuse features from
multiple modalities using adequate unimodal backbones and specific fusion
networks. Although this helps strengthen the multimodal information
representation, designing such networks is labor-intensive. It requires tuning
the architectural parameters of the unimodal backbones, choosing the fusing
point, and selecting the operations for fusion. Furthermore, multimodality AI
is emerging as a cutting-edge option in Internet of Things (IoT) systems where
inference latency and energy consumption are critical metrics in addition to
accuracy. In this paper, we propose Harmonic-NAS, a framework for the joint
optimization of unimodal backbones and multimodal fusion networks with hardware
awareness on resource-constrained devices. Harmonic-NAS involves a two-tier
optimization approach for the unimodal backbone architectures and fusion
strategy and operators. By incorporating the hardware dimension into the
optimization, evaluation results on various devices and multimodal datasets
have demonstrated the superiority of Harmonic-NAS over state-of-the-art
approaches achieving up to 10.9% accuracy improvement, 1.91x latency reduction,
and 2.14x energy efficiency gain.Comment: Accepted to the 15th Asian Conference on Machine Learning (ACML 2023