Automated audio captioning (AAC) which generates textual descriptions of
audio content. Existing AAC models achieve good results but only use the
high-dimensional representation of the encoder. There is always insufficient
information learning of high-dimensional methods owing to high-dimensional
representations having a large amount of information. In this paper, a new
encoder-decoder model called the Low- and High-Dimensional Feature Fusion
(LHDFF) is proposed. LHDFF uses a new PANNs encoder called Residual PANNs
(RPANNs) to fuse low- and high-dimensional features. Low-dimensional features
contain limited information about specific audio scenes. The fusion of low- and
high-dimensional features can improve model performance by repeatedly
emphasizing specific audio scene information. To fully exploit the fused
features, LHDFF uses a dual transformer decoder structure to generate captions
in parallel. Experimental results show that LHDFF outperforms existing audio
captioning models.Comment: INTERSPEECH 2023. arXiv admin note: substantial text overlap with
arXiv:2210.0503