Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Germain, Francois G.; Hori, Chiori; Khurana, Sameer; Masuyama, Yoshiki; Pan, Zexu; Roux, Jonathan Le; Wichern, Gordon

Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

Authors: Francois G. Germain
Chiori Hori
Sameer Khurana
Yoshiki Masuyama
Zexu Pan
Jonathan Le Roux
Gordon Wichern
Publication date: 30 October 2023
Publisher

Abstract

Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker as a conditioning factor during the extraction process. Recognizing the inherent dissimilarities between speech and noise signals as interfering sources, we also propose SAV-GridNet, a scenario-aware model that identifies the type of interfering scenario first and then applies a dedicated expert model trained specifically for that scenario. Our proposed model achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement Challenge, outperforming other models by a significant margin, objectively and in a listening test. We also perform an extensive analysis of the results under the two scenarios.Comment: Accepted by ASRU 202

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2310.19644

Last time updated on 18/01/2024