Dual Transformer Decoder based Features Fusion Network for Automated
  Audio Captioning

Kılıç, Volkan; Liu, Xubo; Mei, Xinhao; Plumbley, Mark D.; Sun, Jianyuan; Wang, Wenwu

Dual Transformer Decoder based Features Fusion Network for Automated Audio Captioning

Authors: Volkan Kılıç
Xubo Liu
Xinhao Mei
Mark D. Plumbley
Jianyuan Sun
Wenwu Wang
Publication date: 30 May 2023
Publisher

Abstract

Automated audio captioning (AAC) which generates textual descriptions of audio content. Existing AAC models achieve good results but only use the high-dimensional representation of the encoder. There is always insufficient information learning of high-dimensional methods owing to high-dimensional representations having a large amount of information. In this paper, a new encoder-decoder model called the Low- and High-Dimensional Feature Fusion (LHDFF) is proposed. LHDFF uses a new PANNs encoder called Residual PANNs (RPANNs) to fuse low- and high-dimensional features. Low-dimensional features contain limited information about specific audio scenes. The fusion of low- and high-dimensional features can improve model performance by repeatedly emphasizing specific audio scene information. To fully exploit the fused features, LHDFF uses a dual transformer decoder structure to generate captions in parallel. Experimental results show that LHDFF outperforms existing audio captioning models.Comment: INTERSPEECH 2023. arXiv admin note: substantial text overlap with arXiv:2210.0503

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2305.18753

Last time updated on 02/06/2023