980 research outputs found

    Speech enhancement using deep learning

    Get PDF
    This thesis explores the possibility to achieve enhancement on noisy speech signals using Deep Neural Networks. Signal enhancement is a classic problem in speech processing. In the last years, researches using deep learning has been used in many speech processing tasks since they have provided very satisfactory results. As a first step, a Signal Analysis Module has been implemented in order to calculate the magnitude and phase of each audio file in the database. The signal is represented into its magnitude and its phase, where the magnitude is modified by the neural network, and then it is reconstructed with the original phase. The implementation of the Neural Networks is divided into two stages.The first stage was the implementation of a Speech Activity Detection Deep Neural Network (SAD-DNN). The magnitude previously calculated, applied to the noisy data, will train the SAD-DNN in order to classify each frame in speech or non-speech. This classification is useful for the network that does the final cleaning. The Speech Activity Detection Deep Neural Network is followed by a Denoising Auto-Encoder (DAE). The magnitude and the label speech or non-speech will be the input of this second Deep Neural Network in charge of denoising the speech signal. The first stage is also optimized to be adequate for the final task in this second stage. In order to do the training, Neural Networks require datasets. In this project the Timit corpus [9] has been used as dataset for the clean voice (target) and the QUT-NOISE TIMIT corpus[4] as noisy dataset (source). Finally, Signal Synthesis Module reconstructs the clean speech signal from the enhanced magnitudes and the phase. In the end, the results provided by the system have been analysed using both objective and subjective measures.Esta tesis explora la posibilidad de conseguir mejorar señales de voz con ruido utilizando Redes Neuronales Profundas. La mejora de señales es un problema clásico del procesado de señal, pero recientemente se esta investigando con deep learning, ya que son técnicas que han dado resultados muy satisfactorios en muchas tareas del procesado de señal. Como primer paso, se ha implementado un Módulo de Análisis de Señal con el objetivo de extraer el módulo y fase de cada archivo de voz de la base de datos. La señal se representa en módulo y fase, donde el módulo se modifica con la red neuronal y posteriormente se reconstruye con la fase original. La implementación de la Red Neuronal consta de dos etapas. En la primera etapa se implementó una Red Neuronal de Detección de Actividad de Voz. El módulo previamente calculado, aplicado a los datos con ruido, se utiliza como entrada para entrenar esta red, de manera que se consigue clasificar cada trama en voz o no voz. Esta clasificación es útil para la red que se encarga de hacer la limpieza. A continuación de la Red Neuronal de Detección de Actividad de Voz se implementa otra, con el objetivo de eliminar el ruido. El módulo junto con la etiqueta obtenida en la red anterior serán la entrada de esta nueva red. En esta segunda etapa también se optimiza la primera para adaptarse a la tarea final. Las Redes Neuronales requieren bases de datos para el entrenamiento. En este proyecto se ha utilizado el Timit corpus [9] como base de datos de voz limpia (objetivo) y el QUT-NOISE TIMIT [4] como base de datos con ruido (fuente). A continuación, el Módulo de Síntesis de Señal reconstruye la señal de voz limpia a partir del módulo sin ruido y la fase original.Aquesta tesis explora la possibilitat d'aconseguir millorar senyals de veu amb soroll, utilitzant Xarxes Neuronals Profundes. La millora de senyals és un problema clàssic del processat de senyal, però recentment s'està investigant amb deep learning, ja que són tècniques que han donat resultats molt satisfactoris en moltes tasques de processament de veu. Com a primer pas, s'ha implementat un Mòdul d'Anàlisi de Senyal amb l'objectiu d'extreure el mòdul i la fase de cada arxiu d'àudio de la base de dades. El senyal es representa en mòdul i fase, on el mòdul es modifica amb la xarxa neuronal i posteriorment es reconstrueix amb la fase original. La implementació de les Xarxes Neuronals consta de dues etapes. En la primera etapa es va implementar una Xarxa Neuronal de Detecció d'Activitat de Veu. El mòdul prèviament calculat, aplicat a les dades amb soroll, s'utilitza com entrada per entrenar aquesta xarxa, de manera que s'aconsegueix classificar cada trama en veu o no veu. Aquesta classificació és útil per la xarxa que fa la neteja final. A continuació de la Xarxa Neuronal de Detecció d'Activitat de Veu s'implementa una altra amb l'objectiu d'eliminar el soroll. El mòdul, juntament amb la etiqueta obtinguda en la xarxa anterior, seran l'entrada d'aquesta nova xarxa. En aquesta segona etapa també s'optimitza la primera per adaptar-se a la tasca final. Les Xarxes Neuronals requereixen bases de dades per fer l'entrenament. En aquest projecte s'ha utilitzat el Timit corpus [9] com a base de dades de veu neta (objectiu) i el QUT-NOISE TIMIT[4] com a base de dades amb soroll (font). A continuació, el Mòdul de Síntesi de Senyal reconstrueix el senyal de veu net a partir del mòdul netejat i la fase original. Finalment, els resultats obtinguts del sistema van ser analitzats utilitzant mesures objectives i subjectives

    Noise adaptive training for subspace Gaussian mixture models

    Get PDF
    Noise adaptive training (NAT) is an effective approach to normalise the environmental distortions in the training data. This paper investigates the model-based NAT scheme using joint uncertainty decoding (JUD) for subspace Gaussian mixture models (SGMMs). A typical SGMM acoustic model has much larger number of surface Gaussian components, which makes it computationally infeasible to compensate each Gaussian explicitly. JUD tackles the problem by sharing the compensation parameters among the Gaussians and hence reduces the computational and memory demands. For noise adaptive training, JUD is reformulated into a generative model, which leads to an efficient expectation-maximisation (EM) based algorithm to update the SGMM acoustic model parameters. We evaluated the SGMMs with NAT on the Aurora 4 database, and obtained higher recognition accuracy compared to systems without adaptive training. Index Terms: adaptive training, noise robustness, joint uncertainty decoding, subspace Gaussian mixture model

    Multimodal person recognition for human-vehicle interaction

    Get PDF
    Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Today's technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies

    Single-channel speech enhancement using implicit Wiener filter for high-quality speech communication

    Get PDF
    Speech enables easy human-to-human communication as well as human-to-machine interaction. However, the quality of speech degrades due to background noise in the environment, such as drone noise embedded in speech during search and rescue operations. Similarly, helicopter noise, airplane noise, and station noise reduce the quality of speech. Speech enhancement algorithms reduce background noise, resulting in a crystal clear and noise-free conversation. For many applications, it is also necessary to process these noisy speech signals at the edge node level. Thus, we propose implicit Wiener filter-based algorithm for speech enhancement using edge computing system. In the proposed algorithm, a first order recursive equation is used to estimate the noise. The performance of the proposed algorithm is evaluated for two speech utterances, one uttered by a male speaker and the other by a female speaker. Both utterances are degraded by different types of non-stationary noises such as exhibition, station, drone, helicopter, airplane, and white Gaussian stationary noise with different signal-to-noise ratios. Further, we compare the performance of the proposed speech enhancement algorithm with the conventional spectral subtraction algorithm. Performance evaluations using objective speech quality measures demonstrate that the proposed speech enhancement algorithm outperforms the spectral subtraction algorithm in estimating the clean speech from the noisy speech. Finally, we implement the proposed speech enhancement algorithm, in addition to the spectral subtraction algorithm, on the Raspberry Pi 4 Model B, which is a low power edge computing device.publishedVersio

    DESIGN AND EVALUATION OF HARMONIC SPEECH ENHANCEMENT AND BANDWIDTH EXTENSION

    Get PDF
    Improving the quality and intelligibility of speech signals continues to be an important topic in mobile communications and hearing aid applications. This thesis explored the possibilities of improving the quality of corrupted speech by cascading a log Minimum Mean Square Error (logMMSE) noise reduction system with a Harmonic Speech Enhancement (HSE) system. In HSE, an adaptive comb filter is deployed to harmonically filter the useful speech signal and suppress the noisy components to noise floor. A Bandwidth Extension (BWE) algorithm was applied to the enhanced speech for further improvements in speech quality. Performance of this algorithm combination was evaluated using objective speech quality metrics across a variety of noisy and reverberant environments. Results showed that the logMMSE and HSE combination enhanced the speech quality in any reverberant environment and in the presence of multi-talker babble. The objective improvements associated with the BWE were found to be minima

    Effects of noise suppression and envelope dynamic range compression on the intelligibility of vocoded sentences for a tonal language

    Get PDF
    Vocoder simulation studies have suggested that the carrier signal type employed affects the intelligibility of vocoded speech. The present work further assessed how carrier signal type interacts with additional signal processing, namely, single-channel noise suppression and envelope dynamic range compression, in determining the intelligibility of vocoder simulations. In Experiment 1, Mandarin sentences that had been corrupted by speech spectrum-shaped noise (SSN) or two-talker babble (2TB) were processed by one of four single-channel noise-suppression algorithms before undergoing tone-vocoded (TV) or noise-vocoded (NV) processing. In Experiment 2, dynamic ranges of multiband envelope waveforms were compressed by scaling of the mean-removed envelope waveforms with a compression factor before undergoing TV or NV processing. TV Mandarin sentences yielded higher intelligibility scores with normal-hearing (NH) listeners than did noise-vocoded sentences. The intelligibility advantage of noise-suppressed vocoded speech depended on the masker type (SSN vs 2TB). NV speech was more negatively influenced by envelope dynamic range compression than was TV speech. These findings suggest that an interactional effect exists between the carrier signal type employed in the vocoding process and envelope distortion caused by signal processing

    Sound Zones as an Optimal Filtering Problem

    Get PDF

    Deep Learning-based Speech Enhancement for Real-life Applications

    Get PDF
    Speech enhancement is the process of improving speech quality and intelligibility by suppressing noise. Inspired by the outstanding performance of the deep learning approach for speech enhancement, this thesis aims to add to this research area through the following contributions. The thesis presents an experimental analysis of different deep neural networks for speech enhancement, to compare their performance and investigate factors and approaches that improve the performance. The outcomes of this analysis facilitate the development of better speech enhancement networks in this work. Moreover, this thesis proposes a new deep convolutional denoising autoencoderbased speech enhancement architecture, in which strided and dilated convolutions were applied to improve the performance while keeping network complexity to a minimum. Furthermore, a two-stage speech enhancement approach is proposed that reduces distortion, by performing a speech denoising first stage in the frequency domain, followed by a second speech reconstruction stage in the time domain. This approach was proven to reduce speech distortion, leading to better overall quality of the processed speech in comparison to state-of-the-art speech enhancement models. Finally, the work presents two deep neural network speech enhancement architectures for hearing aids and automatic speech recognition, as two real-world speech enhancement applications. A smart speech enhancement architecture was proposed for hearing aids, which is an integrated hearing aid and alert system. This architecture enhances both speech and important emergency noise, and only eliminates undesired noise. The results show that this idea is applicable to improve the performance of hearing aids. On the other hand, the architecture proposed for automatic speech recognition solves the mismatch issue between speech enhancement automatic speech recognition systems, leading to significant reduction in the word error rate of a baseline automatic speech recognition system, provided by Intelligent Voice for research purposes. In conclusion, the results presented in this thesis show promising performance for the proposed architectures for real time speech enhancement applications
    corecore