33 research outputs found

    End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

    Full text link
    Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where queries and documents are encoded with a pair of recurrent neural network encoders and the encodings are combined with a dot-product. In this article, we extend this work with multilingual pretraining and detailed analysis of the model. Our experiments show that the proposed multilingual training significantly improves the model performance and that despite not matching a strong ASR-based conventional keyword search system for short queries and queries comprising in-vocabulary words, the proposed model outperforms the ASR-based system for long queries and queries that do not appear in the training data.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 202

    Quantification de séquences spectrales de longueurs variables pour le codage de la parole à très bas débit

    Get PDF
    Ce papier traite du codage des paramètres spectraux pour le codage de parole à très bas débit. Nous présentons une nouvelle interprétation de recherches précédemment publiées par Chou-Lockabaugh et Cemocky-Baudoin-Chollet sur la quantification de séquences spectrales de longueurs variables, sous les noms respectifs de « Variable to Variable length Vector Quantization » (VVVQ) et de quantification par multigrammes (MGQ). Nous avons, d'autre part étudié l'influence de la limitation du retard introduit par la méthode et proposé une technique pour optimiser les performances en présence d'un retard maximum imposé. Nous avons ainsi trouvé qu'un retard de 400 ms est généralement suffisant. Enfin, nous proposons l'introduction de longues séquences dans le dictionnaire par interpolation linéaire des séquences courtes

    Target Speech Extraction with Pre-trained Self-supervised Learning Models

    Full text link
    Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.Comment: Accepted to ICASSP 202

    An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

    Full text link
    In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization and learning rate schedule to stabilize the fine-tuning process and further boost performance: multi-head factorized attentive pooling is proposed to factorize the comparison of speaker representations into multiple phonetic clusters. We regularize towards the parameters of the pre-trained model and we set different learning rates for each layer of the pre-trained model during fine-tuning. The experimental results show our method can significantly shorten the training time to 4 hours and achieve SOTA performance: 0.59%, 0.79% and 1.77% EER on Vox1-O, Vox1-E and Vox1-H, respectively.Comment: Accepted by SLT202

    Probing Self-supervised Learning Models with Target Speech Extraction

    Full text link
    Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.Comment: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) worksho
    corecore