Audio captioning aims at describing acoustic scenes with natural language. Systems are currently evaluated by image cap-tioning metrics CIDEr and SPICE. However, recent studies have highlighted a poor correlation of these metrics with human assessments. In this paper, we propose SPICE+, a modification of SPICE that improves caption annotation and comparison with pre-trained language models. The metric parses captions to semantic graphs with a deep dependency annotation model and a refined set of linguistic rules, then compares sentence embeddings of candidate and reference semantic elements. We formulate a score for general-purpose captioning evaluation, that can be tailored to more specific applications. Combined with fluency error detection, the metric achieves competitive performance on the FENSE benchmark, with 84.0% accuracy on AudioCaps and 74.1% on Clotho.Further experiments show that the metric behaves similarly to the full sentence embedding similarity, while the decomposition into semantic elements allows better interpretability of scores and can provide additional information on the properties of captioning  systems

Cerisara, Christophe

Gontier, Félix

Serizel, Romain

International audienceAudio captioning aims at describing acoustic scenes with natural language. Systems are currently evaluated by image cap-tioning metrics CIDEr and SPICE. However, recent studies have highlighted a poor correlation of these metrics with human assessments. In this paper, we propose SPICE+, a modification of SPICE that improves caption annotation and comparison with pre-trained language models. The metric parses captions to semantic graphs with a deep dependency annotation model and a refined set of linguistic rules, then compares sentence embeddings of candidate and reference semantic elements. We formulate a score for general-purpose captioning evaluation, that can be tailored to more specific applications. Combined with fluency error detection, the metric achieves competitive performance on the FENSE benchmark, with 84.0% accuracy on AudioCaps and 74.1% on Clotho.Further experiments show that the metric behaves similarly to the full sentence embedding similarity, while the decomposition into semantic elements allows better interpretability of scores and can provide additional information on the properties of captioning  systems

INRIA a CCSD electronic archive server

SPICE+: Evaluation of automatic audio captioning systems with pre-trained language models

https://hal.inria.fr/hal-03933981v1/file/icassp2023spicep.pdf

SPICE+: EVALUATION OF AUTOMATIC AUDIO CAPTIONING SYSTEMS WITH PRE-TRAINED LANGUAGE MODELS

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server