127 research outputs found

    Sound Object Recognition

    Get PDF
    Humans are constantly exposed to a variety of acoustic stimuli ranging from music and speech to more complex acoustic scenes like a noisy marketplace. The human auditory perception mechanism is able to analyze these different kinds of sounds and extract meaningful information suggesting that the same processing mechanism is capable of representing different sound classes. In this thesis, we test this hypothesis by proposing a high dimensional sound object representation framework, that captures the various modulations of sound by performing a multi-resolution mapping. We then show that this model is able to capture a wide variety of sound classes (speech, music, soundscapes) by applying it to the tasks of speech recognition, speaker verification, musical instrument recognition and acoustic soundscape recognition. We propose a multi-resolution analysis approach that captures the detailed variations in the spectral characterists as a basis for recognizing sound objects. We then show how such a system can be fine tuned to capture both the message information (speech content) and the messenger information (speaker identity). This system is shown to outperform state-of-art system for noise robustness at both automatic speech recognition and speaker verification tasks. The proposed analysis scheme with the included ability to analyze temporal modulations was used to capture musical sound objects. We showed that using a model of cortical processing, we were able to accurately replicate the human perceptual similarity judgments and also were able to get a good classification performance on a large set of musical instruments. We also show that neither just the spectral feature or the marginals of the proposed model are sufficient to capture human perception. Moreover, we were able to extend this model to continuous musical recordings by proposing a new method to extract notes from the recordings. Complex acoustic scenes like a sports stadium have multiple sources producing sounds at the same time. We show that the proposed representation scheme can not only capture these complex acoustic scenes, but provides a flexible mechanism to adapt to target sources of interest. The human auditory perception system is known to be a complex system where there are both bottom-up analysis pathways and top-down feedback mechanisms. The top-down feedback enhances the output of the bottom-up system to better realize the target sounds. In this thesis we propose an implementation of top-down attention module which is complimentary to the high dimensional acoustic feature extraction mechanism. This attention module is a distributed system operating at multiple stages of representation, effectively acting as a retuning mechanism, that adapts the same system to different tasks. We showed that such an adaptation mechanism is able to tremendously improve the performance of the system at detecting the target source in the presence of various distracting background sources

    A novel neural feature for a text-dependent speaker identification system

    Get PDF
    A novel feature based on the simulated neural response of the auditory periphery was proposed in this study for a speaker identification system. A well-known computational model of the auditory-nerve (AN) fiber by Zilany and colleagues, which incorporates most of the stages and the relevant nonlinearities observed in the peripheral auditory system, was employed to simulate neural responses to speech signals from different speakers. Neurograms were constructed from responses of inner-hair-cell (IHC)-AN synapses with characteristic frequencies spanning the dynamic range of hearing. The synapse responses were subjected to an analytical function to incorporate the effects of absolute and relative refractory periods. The proposed IHC-AN neurogram feature was then used to train and test the text-dependent speaker identification system using standard classifiers. The performance of the proposed method was compared to the results from existing baseline methods for both quiet and noisy conditions. While the performance using the proposed feature was comparable to the results of existing methods in quiet environments, the neural feature exhibited a substantially better classification accuracy in noisy conditions, especially with white Gaussian and street noises. Also, the performance of the proposed system was relatively independent of various types of distortions in the acoustic signals and classifiers. The proposed feature can be employed to design a robust speech recognition system

    Deep spiking neural networks with applications to human gesture recognition

    Get PDF
    The spiking neural networks (SNNs), as the 3rd generation of Artificial Neural Networks (ANNs), are a class of event-driven neuromorphic algorithms that potentially have a wide range of application domains and are applicable to a variety of extremely low power neuromorphic hardware. The work presented in this thesis addresses the challenges of human gesture recognition using novel SNN algorithms. It discusses the design of these algorithms for both visual and auditory domain human gesture recognition as well as event-based pre-processing toolkits for audio signals. From the visual gesture recognition aspect, a novel SNN-based event-driven hand gesture recognition system is proposed. This system is shown to be effective in an experiment on hand gesture recognition with its spiking recurrent convolutional neural network (SCRNN) design, which combines both designed convolution operation and recurrent connectivity to maintain spatial and temporal relations with address-event-representation (AER) data. The proposed SCRNN architecture can achieve arbitrary temporal resolution, which means it can exploit temporal correlations between event collections. This design utilises a backpropagation-based training algorithm and does not suffer from gradient vanishing/explosion problems. From the audio perspective, a novel end-to-end spiking speech emotion recognition system (SER) is proposed. This system employs the MFCC as its main speech feature extractor as well as a self-designed latency coding algorithm to effciently convert the raw signal to AER input that can be used for SNN. A two-layer spiking recurrent architecture is proposed to address temporal correlations between spike trains. The robustness of this system is supported by several open public datasets, which demonstrate state of the arts recognition accuracy and a significant reduction in network size, computational costs, and training speed. In addition to directly contributing to neuromorphic SER, this thesis proposes a novel speech-coding algorithm based on the working mechanism of humans auditory organ system. The algorithm mimics the functionality of the cochlea and successfully provides an alternative method of event-data acquisition for audio-based data. The algorithm is then further simplified and extended into an application of speech enhancement which is jointly used in the proposed SER system. This speech-enhancement method uses the lateral inhibition mechanism as a frequency coincidence detector to remove uncorrelated noise in the time-frequency spectrum. The method is shown to be effective by experiments for up to six types of noise.The spiking neural networks (SNNs), as the 3rd generation of Artificial Neural Networks (ANNs), are a class of event-driven neuromorphic algorithms that potentially have a wide range of application domains and are applicable to a variety of extremely low power neuromorphic hardware. The work presented in this thesis addresses the challenges of human gesture recognition using novel SNN algorithms. It discusses the design of these algorithms for both visual and auditory domain human gesture recognition as well as event-based pre-processing toolkits for audio signals. From the visual gesture recognition aspect, a novel SNN-based event-driven hand gesture recognition system is proposed. This system is shown to be effective in an experiment on hand gesture recognition with its spiking recurrent convolutional neural network (SCRNN) design, which combines both designed convolution operation and recurrent connectivity to maintain spatial and temporal relations with address-event-representation (AER) data. The proposed SCRNN architecture can achieve arbitrary temporal resolution, which means it can exploit temporal correlations between event collections. This design utilises a backpropagation-based training algorithm and does not suffer from gradient vanishing/explosion problems. From the audio perspective, a novel end-to-end spiking speech emotion recognition system (SER) is proposed. This system employs the MFCC as its main speech feature extractor as well as a self-designed latency coding algorithm to effciently convert the raw signal to AER input that can be used for SNN. A two-layer spiking recurrent architecture is proposed to address temporal correlations between spike trains. The robustness of this system is supported by several open public datasets, which demonstrate state of the arts recognition accuracy and a significant reduction in network size, computational costs, and training speed. In addition to directly contributing to neuromorphic SER, this thesis proposes a novel speech-coding algorithm based on the working mechanism of humans auditory organ system. The algorithm mimics the functionality of the cochlea and successfully provides an alternative method of event-data acquisition for audio-based data. The algorithm is then further simplified and extended into an application of speech enhancement which is jointly used in the proposed SER system. This speech-enhancement method uses the lateral inhibition mechanism as a frequency coincidence detector to remove uncorrelated noise in the time-frequency spectrum. The method is shown to be effective by experiments for up to six types of noise

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació.Postprint (published version

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació

    Parts-based models and local features for automatic speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 101-108).While automatic speech recognition (ASR) systems have steadily improved and are now in widespread use, their accuracy continues to lag behind human performance, particularly in adverse conditions. This thesis revisits the basic acoustic modeling assumptions common to most ASR systems and argues that improvements to the underlying model of speech are required to address these shortcomings. A number of problems with the standard method of hidden Markov models (HMMs) and features derived from fixed, frame-based spectra (e.g. MFCCs) are discussed. Based on these problems, a set of desirable properties of an improved acoustic model are proposed, and we present a "parts-based" framework as an alternative. The parts-based model (PBM), based on previous work in machine vision, uses graphical models to represent speech with a deformable template of spectro-temporally localized "parts", as opposed to modeling speech as a sequence of fixed spectral profiles. We discuss the proposed model's relationship to HMMs and segment-based recognizers, and describe how they can be viewed as special cases of the PBM. Two variations of PBMs are described in detail. The first represents each phonetic unit with a set of time-frequency (T-F) "patches" which act as filters over a spectrogram. The model structure encodes the patches' relative T-F positions. The second variation, referred to as a "speech schematic" model, more directly encodes the information in a spectrogram by using simple edge detectors and focusing more on modeling the constraints between parts.(cont.) We demonstrate the proposed models on various isolated recognition tasks and show the benefits over baseline systems, particularly in noisy conditions and when only limited training data is available. We discuss efficient implementation of the models and describe how they can be combined to build larger recognition systems. It is argued that the flexible templates used in parts-based modeling may provide a better generative model of speech than typical HMMs.by Kenneth Thomas Schutte.Ph.D

    Single-Channel Speech Enhancement Based on Deep Neural Networks

    Get PDF
    Speech enhancement (SE) aims to improve the speech quality of the degraded speech. Recently, researchers have resorted to deep-learning as a primary tool for speech enhancement, which often features deterministic models adopting supervised training. Typically, a neural network is trained as a mapping function to convert some features of noisy speech to certain targets that can be used to reconstruct clean speech. These methods of speech enhancement using neural networks have been focused on the estimation of spectral magnitude of clean speech considering that estimating spectral phase with neural networks is difficult due to the wrapping effect. As an alternative, complex spectrum estimation implicitly resolves the phase estimation problem and has been proven to outperform spectral magnitude estimation. In the first contribution of this thesis, a fully convolutional neural network (FCN) is proposed for complex spectrogram estimation. Stacked frequency-dilated convolution is employed to obtain an exponential growth of the receptive field in frequency domain. The proposed network also features an efficient implementation that requires much fewer parameters as compared with conventional deep neural network (DNN) and convolutional neural network (CNN) while still yielding a comparable performance. Consider that speech enhancement is only useful in noisy conditions, yet conventional SE methods often do not adapt to different noisy conditions. In the second contribution, we proposed a model that provides an automatic "on/off" switch for speech enhancement. It is capable of scaling its computational complexity under different signal-to-noise ratio (SNR) levels by detecting clean or near-clean speech which requires no processing. By adopting information maximizing generative adversarial network (InfoGAN) in a deterministic, supervised manner, we incorporate the functionality of SNR-indicator into the model that adds little additional cost to the system. We evaluate the proposed SE methods with two objectives: speech intelligibility and application to automatic speech recognition (ASR). Experimental results have shown that the CNN-based model is applicable for both objectives while the InfoGAN-based model is more useful in terms of speech intelligibility. The experiments also show that SE for ASR may be more challenging than improving the speech intelligibility, where a series of factors, including training dataset and neural network models, would impact the ASR performance

    Robust speaker recognition in presence of non-trivial environmental noise (toward greater biometric security)

    Get PDF
    The aim of this thesis is to investigate speaker recognition in the presence of environmental noise, and to develop a robust speaker recognition method. Recently, Speaker Recognition has been the object of considerable research due to its wide use in various areas. Despite major developments in this field, there are still many limitations and challenges. Environmental noises and their variations are high up in the list of challenges since it impossible to provide a noise free environment. A novel approach is proposed to address the issue of performance degradation in environmental noise. This approach is based on the estimation of signal-to-noise ratio (SNR) and detection of ambient noise from the recognition signal to re-train the reference model for the claimed speaker and to generate a new adapted noisy model to decrease the noise mismatch with recognition utterances. This approach is termed “Training on the fly” for robustness of speaker recognition under noisy environments. To detect the noise in the recognition signal two different techniques are proposed: the first technique including generating an emulated noise depending on estimated power spectrum of the original noise using 1/3 octave band filter bank and white noise signal. This emulated noise become close enough to original one that includes in the input signal (recognition signal). The second technique deals with extracting the noise from the input signal using one of speech enhancement algorithm with spectral subtraction to find the noise in the signal. Training on the fly approach (using both techniques) has been examined using two feature approaches and two different kinds of artificial clean and noisy speech databases collected in different environments. Furthermore, the speech samples were text independent. The training on the fly approach is a significant improvement in performance when compared with the performance of conventional speaker recognition (based on clean reference models). Moreover, the training on the fly based on noise extraction showed the best results for all types of noisy data

    Speech Enhancement with Improved Deep Learning Methods

    Get PDF
    In real-world environments, speech signals are often corrupted by ambient noises during their acquisition, leading to degradation of quality and intelligibility of the speech for a listener. As one of the central topics in the speech processing area, speech enhancement aims to recover clean speech from such a noisy mixture. Many traditional speech enhancement methods designed based on statistical signal processing have been proposed and widely used in the past. However, the performance of these methods was limited and thus failed in sophisticated acoustic scenarios. Over the last decade, deep learning as a primary tool to develop data-driven information systems has led to revolutionary advances in speech enhancement. In this context, speech enhancement is treated as a supervised learning problem, which does not suffer from issues faced by traditional methods. This supervised learning problem has three main components: input features, learning machine, and training target. In this thesis, various deep learning architectures and methods are developed to deal with the current limitations of these three components. First, we propose a serial hybrid neural network model integrating a new low-complexity fully-convolutional convolutional neural network (CNN) and a long short-term memory (LSTM) network to estimate a phase-sensitive mask for speech enhancement. Instead of using traditional acoustic features as the input of the model, a CNN is employed to automatically extract sophisticated speech features that can maximize the performance of a model. Then, an LSTM network is chosen as the learning machine to model strong temporal dynamics of speech. The model is designed to take full advantage of the temporal dependencies and spectral correlations present in the input speech signal while keeping the model complexity low. Also, an attention technique is embedded to recalibrate the useful CNN-extracted features adaptively. Through extensive comparative experiments, we show that the proposed model significantly outperforms some known neural network-based speech enhancement methods in the presence of highly non-stationary noises, while it exhibits a relatively small number of model parameters compared to some commonly employed DNN-based methods. Most of the available approaches for speech enhancement using deep neural networks face a number of limitations: they do not exploit the information contained in the phase spectrum, while their high computational complexity and memory requirements make them unsuited for real-time applications. Hence, a new phase-aware composite deep neural network is proposed to address these challenges. Specifically, magnitude processing with spectral mask and phase reconstruction using phase derivative are proposed as key subtasks of the new network to simultaneously enhance the magnitude and phase spectra. Besides, the neural network is meticulously designed to take advantage of strong temporal and spectral dependencies of speech, while its components perform independently and in parallel to speed up the computation. The advantages of the proposed PACDNN model over some well-known DNN-based SE methods are demonstrated through extensive comparative experiments. Considering that some acoustic scenarios could be better handled using a number of low-complexity sub-DNNs, each specifically designed to perform a particular task, we propose another very low complexity and fully convolutional framework, performing speech enhancement in short-time modified discrete cosine transform (STMDCT) domain. This framework is made up of two main stages: classification and mapping. In the former stage, a CNN-based network is proposed to classify the input speech based on its utterance-level attributes, i.e., signal-to-noise ratio and gender. In the latter stage, four well-trained CNNs specialized for different specific and simple tasks transform the STMDCT of noisy input speech to the clean one. Since this framework is designed to perform in the STMDCT domain, there is no need to deal with the phase information, i.e., no phase-related computation is required. Moreover, the training target length is only one-half of those in the previous chapters, leading to lower computational complexity and less demand for the mapping CNNs. Although there are multiple branches in the model, only one of the expert CNNs is active for each time, i.e., the computational burden is related only to a single branch at anytime. Also, the mapping CNNs are fully convolutional, and their computations are performed in parallel, thus reducing the computational time. Moreover, this proposed framework reduces the latency by %55 compared to the models in the previous chapters. Through extensive experimental studies, it is shown that the MBSE framework not only gives a superior speech enhancement performance but also has a lower complexity compared to some existing deep learning-based methods
    corecore