33 research outputs found
End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations
Conventional keyword search systems operate on automatic speech recognition
(ASR) outputs, which causes them to have a complex indexing and search
pipeline. This has led to interest in ASR-free approaches to simplify the
search procedure. We recently proposed a neural ASR-free keyword search model
which achieves competitive performance while maintaining an efficient and
simplified pipeline, where queries and documents are encoded with a pair of
recurrent neural network encoders and the encodings are combined with a
dot-product. In this article, we extend this work with multilingual pretraining
and detailed analysis of the model. Our experiments show that the proposed
multilingual training significantly improves the model performance and that
despite not matching a strong ASR-based conventional keyword search system for
short queries and queries comprising in-vocabulary words, the proposed model
outperforms the ASR-based system for long queries and queries that do not
appear in the training data.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language
Processing (TASLP), 202
Quantification de séquences spectrales de longueurs variables pour le codage de la parole à très bas débit
Ce papier traite du codage des paramètres spectraux pour le codage de parole à très bas débit. Nous présentons une nouvelle interprétation de recherches précédemment publiées par Chou-Lockabaugh et Cemocky-Baudoin-Chollet sur la quantification de séquences spectrales de longueurs variables, sous les noms respectifs de « Variable to Variable length Vector Quantization » (VVVQ) et de quantification par multigrammes (MGQ). Nous avons, d'autre part étudié l'influence de la limitation du retard introduit par la méthode et proposé une technique pour optimiser les performances en présence d'un retard maximum imposé. Nous avons ainsi trouvé qu'un retard de 400 ms est généralement suffisant. Enfin, nous proposons l'introduction de longues séquences dans le dictionnaire par interpolation linéaire des séquences courtes
Target Speech Extraction with Pre-trained Self-supervised Learning Models
Pre-trained self-supervised learning (SSL) models have achieved remarkable
success in various speech tasks. However, their potential in target speech
extraction (TSE) has not been fully exploited. TSE aims to extract the speech
of a target speaker in a mixture guided by enrollment utterances. We exploit
pre-trained SSL models for two purposes within a TSE framework, i.e., to
process the input mixture and to derive speaker embeddings from the enrollment.
In this paper, we focus on how to effectively use SSL models for TSE. We first
introduce a novel TSE downstream task following the SUPERB principles. This
simple experiment shows the potential of SSL models for TSE, but extraction
performance remains far behind the state-of-the-art. We then extend a powerful
TSE architecture by incorporating two SSL-based modules: an Adaptive Input
Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes
intermediate representations from the CNN encoder by adjusting the time
resolution of CNN encoder and transformer blocks through progressive
upsampling, capturing both fine-grained and hierarchical features. Our method
outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on
LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning
the whole model including the SSL model parameters.Comment: Accepted to ICASSP 202
An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification
In recent years, self-supervised learning paradigm has received extensive
attention due to its great success in various down-stream tasks. However, the
fine-tuning strategies for adapting those pre-trained models to speaker
verification task have yet to be fully explored. In this paper, we analyze
several feature extraction approaches built on top of a pre-trained model, as
well as regularization and learning rate schedule to stabilize the fine-tuning
process and further boost performance: multi-head factorized attentive pooling
is proposed to factorize the comparison of speaker representations into
multiple phonetic clusters. We regularize towards the parameters of the
pre-trained model and we set different learning rates for each layer of the
pre-trained model during fine-tuning. The experimental results show our method
can significantly shorten the training time to 4 hours and achieve SOTA
performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H,
respectively.Comment: Accepted by SLT202
Probing Self-supervised Learning Models with Target Speech Extraction
Large-scale pre-trained self-supervised learning (SSL) models have shown
remarkable advancements in speech-related tasks. However, the utilization of
these models in complex multi-talker scenarios, such as extracting a target
speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce
target speech extraction (TSE) as a novel downstream task to evaluate the
feature extraction capabilities of pre-trained SSL models. TSE uniquely
requires both speaker identification and speech separation, distinguishing it
from other tasks in the Speech processing Universal PERformance Benchmark
(SUPERB) evaluation. Specifically, we propose a TSE downstream model composed
of two lightweight task-oriented modules based on the same frozen SSL model.
One module functions as a speaker encoder to obtain target speaker information
from an enrollment speech, while the other estimates the target speaker's mask
to extract its speech from the mixture. Experimental results on the Libri2mix
datasets reveal the relevance of the TSE downstream task to probe SSL models,
as its performance cannot be simply deduced from other related tasks such as
speaker verification and separation.Comment: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and
Beyond (SASB) worksho