27 research outputs found
Streaming Target-Speaker ASR with Neural Transducer
Although recent advances in deep learning technology have boosted automatic
speech recognition (ASR) performance in the single-talker case, it remains
difficult to recognize multi-talker speech in which many voices overlap. One
conventional approach to tackle this problem is to use a cascade of a speech
separation or target speech extraction front-end with an ASR back-end. However,
the extra computation costs of the front-end module are a critical barrier to
quick response, especially for streaming ASR. In this paper, we propose a
target-speaker ASR (TS-ASR) system that implicitly integrates the target speech
extraction functionality within a streaming end-to-end (E2E) ASR system, i.e.
recurrent neural network-transducer (RNNT). Our system uses a similar idea as
adopted for target speech extraction, but implements it directly at the level
of the encoder of RNNT. This allows TS-ASR to be realized without placing extra
computation costs on the front-end. Note that this study presents two major
differences between prior studies on E2E TS-ASR; we investigate streaming
models and base our study on Conformer models, whereas prior studies used
RNN-based systems and considered only offline processing. We confirm in
experiments that our TS-ASR achieves comparable recognition performance with
conventional cascade systems in the offline setting, while reducing computation
costs and realizing streaming TS-ASR.Comment: Accepted to Interspeech 202
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Pre-trained self-supervised learning (SSL) models have achieved remarkable
success in various speech tasks. However, their potential in target speech
extraction (TSE) has not been fully exploited. TSE aims to extract the speech
of a target speaker in a mixture guided by enrollment utterances. We exploit
pre-trained SSL models for two purposes within a TSE framework, i.e., to
process the input mixture and to derive speaker embeddings from the enrollment.
In this paper, we focus on how to effectively use SSL models for TSE. We first
introduce a novel TSE downstream task following the SUPERB principles. This
simple experiment shows the potential of SSL models for TSE, but extraction
performance remains far behind the state-of-the-art. We then extend a powerful
TSE architecture by incorporating two SSL-based modules: an Adaptive Input
Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes
intermediate representations from the CNN encoder by adjusting the time
resolution of CNN encoder and transformer blocks through progressive
upsampling, capturing both fine-grained and hierarchical features. Our method
outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on
LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning
the whole model including the SSL model parameters.Comment: Accepted to ICASSP 202
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
In many situations, we would like to hear desired sound events (SEs) while
being able to ignore interference. Target sound extraction (TSE) aims at
tackling this problem by estimating the sound of target SE classes in a mixture
while suppressing all other sounds. We can achieve this with a neural network
that extracts the target SEs by conditioning it on clues representing the
target SE classes. Two types of clues have been proposed, i.e., target SE class
labels and enrollment sound samples similar to the target sound. Systems based
on SE class labels can directly optimize embedding vectors representing the SE
classes, resulting in high extraction performance. However, extending these
systems to the extraction of new SE classes not encountered during training is
not easy. Enrollment-based approaches extract SEs by finding sounds in the
mixtures that share similar characteristics to the enrollment. These approaches
do not explicitly rely on SE class definitions and can thus handle new SE
classes. In this paper, we introduce a TSE framework, SoundBeam, that combines
the advantages of both approaches. We also perform an extensive evaluation of
the different TSE schemes using synthesized and real mixtures, which shows the
potential of SoundBeam.Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processin
How does end-to-end speech recognition training impact speech enhancement artifacts?
Jointly training a speech enhancement (SE) front-end and an automatic speech
recognition (ASR) back-end has been investigated as a way to mitigate the
influence of \emph{processing distortion} generated by single-channel SE on
ASR. In this paper, we investigate the effect of such joint training on the
signal-level characteristics of the enhanced signals from the viewpoint of the
decomposed noise and artifact errors. The experimental analyses provide two
novel findings: 1) ASR-level training of the SE front-end reduces the artifact
errors while increasing the noise errors, and 2) simply interpolating the
enhanced and observed signals, which achieves a similar effect of reducing
artifacts and increasing noise, improves ASR performance without jointly
modifying the SE and ASR modules, even for a strong ASR back-end using a WavLM
feature extractor. Our findings provide a better understanding of the effect of
joint training and a novel insight for designing an ASR agnostic SE front-end.Comment: 5 pages, 1 figure, 1 tabl
Exploiting Attention-based Sequence-to-Sequence Architectures for Sound Event Localization
Sound event localization frameworks based on deep neural networks have shown
increased robustness with respect to reverberation and noise in comparison to
classical parametric approaches. In particular, recurrent architectures that
incorporate temporal context into the estimation process seem to be well-suited
for this task. This paper proposes a novel approach to sound event localization
by utilizing an attention-based sequence-to-sequence model. These types of
models have been successfully applied to problems in natural language
processing and automatic speech recognition. In this work, a multi-channel
audio signal is encoded to a latent representation, which is subsequently
decoded to a sequence of estimated directions-of-arrival. Herein, attentions
allow for capturing temporal dependencies in the audio signal by focusing on
specific frames that are relevant for estimating the activity and
direction-of-arrival of sound events at the current time-step. The framework is
evaluated on three publicly available datasets for sound event localization. It
yields superior localization performance compared to state-of-the-art methods
in both anechoic and reverberant conditions.Comment: Published in Proceedings of the 28th European Signal Processing
Conference (EUSIPCO), 202