109 research outputs found

    Deep Learning for Distant Speech Recognition

    Full text link
    Deep learning is an emerging technology that is considered one of the most promising directions for reaching higher levels of artificial intelligence. Among the other achievements, building computers that understand speech represents a crucial leap towards intelligent machines. Despite the great efforts of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially when users interact with a distant microphone in noisy and reverberant environments. The latter disturbances severely hamper the intelligibility of a speech signal, making Distant Speech Recognition (DSR) one of the major open challenges in the field. This thesis addresses the latter scenario and proposes some novel techniques, architectures, and algorithms to improve the robustness of distant-talking acoustic models. We first elaborate on methodologies for realistic data contamination, with a particular emphasis on DNN training with simulated data. We then investigate on approaches for better exploiting speech contexts, proposing some original methodologies for both feed-forward and recurrent neural networks. Lastly, inspired by the idea that cooperation across different DNNs could be the key for counteracting the harmful effects of noise and reverberation, we propose a novel deep learning paradigm called network of deep neural networks. The analysis of the original concepts were based on extensive experimental validations conducted on both real and simulated data, considering different corpora, microphone configurations, environments, noisy conditions, and ASR tasks.Comment: PhD Thesis Unitn, 201

    Automatic Speech Recognition for Low-resource Languages and Accents Using Multilingual and Crosslingual Information

    Get PDF
    This thesis explores methods to rapidly bootstrap automatic speech recognition systems for languages, which lack resources for speech and language processing. We focus on finding approaches which allow using data from multiple languages to improve the performance for those languages on different levels, such as feature extraction, acoustic modeling and language modeling. Under application aspects, this thesis also includes research work on non-native and Code-Switching speech

    Noise-Robust Speech Recognition Using Deep Neural Network

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Temporally Varying Weight Regression for Speech Recognition

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Environmentally robust ASR front-end for deep neural network acoustic models

    Get PDF
    This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació.Postprint (published version

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Discriminative and adaptive training for robust speech recognition and understanding

    Get PDF
    Robust automatic speech recognition (ASR) and understanding (ASU) under various conditions remains to be a challenging problem even with the advances of deep learning. To achieve robust ASU, two discriminative training objectives are proposed for keyword spotting and topic classification: (1) To accurately recognize the semantically important keywords, the non-uniform error cost minimum classification error training of deep neural network (DNN) and bi-directional long short-term memory (BLSTM) acoustic models is proposed to minimize the recognition errors of only the keywords. (2) To compensate for the mismatched objectives of speech recognition and understanding, minimum semantic error cost training of the BLSTM acoustic model is proposed to generate semantically accurate lattices for topic classification. Further, to expand the application of the ASU system to various conditions, four adaptive training approaches are proposed to improve the robustness of the ASR under different conditions: (1) To suppress the effect of inter-speaker variability on speaker-independent DNN acoustic model, speaker-invariant training is proposed to learn a deep representation in the DNN that is both senone-discriminative and speaker-invariant through adversarial multi-task training (2) To achieve condition-robust unsupervised adaptation with parallel data, adversarial teacher-student learning is proposed to suppress multiple factors of condition variability in the procedure of knowledge transfer from a well-trained source domain LSTM acoustic model to the target domain. (3) To further improve the adversarial learning for unsupervised adaptation with unparallel data, domain separation networks are used to enhance the domain-invariance of the senone-discriminative deep representation by explicitly modeling the private component that is unique to each domain. (4) To achieve robust far-field ASR, an LSTM adaptive beamforming network is proposed to estimate the real-time beamforming filter coefficients to cope with non-stationary environmental noise and dynamic nature of source and microphones positions.Ph.D
    corecore