86 research outputs found
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Isolating the voice of a specific person while filtering out other voices or
background noises is challenging when video is shot in noisy environments. We
propose audio-visual methods to isolate the voice of a single speaker and
eliminate unrelated sounds. First, face motions captured in the video are used
to estimate the speaker's voice, by passing the silent video frames through a
video-to-speech neural network-based model. Then the speech predictions are
applied as a filter on the noisy input audio. This approach avoids using
mixtures of sounds in the learning process, as the number of such possible
mixtures is huge, and would inevitably bias the trained model. We evaluate our
method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our
method attains significant SDR and PESQ improvements over the raw
video-to-speech predictions, and a well-known audio-only method.Comment: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzo
- …