391,815 research outputs found

    Two-stage Neural Network for ICASSP 2023 Speech Signal Improvement Challenge

    Full text link
    In ICASSP 2023 speech signal improvement challenge, we developed a dual-stage neural model which improves speech signal quality induced by different distortions in a stage-wise divide-and-conquer fashion. Specifically, in the first stage, the speech improvement network focuses on recovering the missing components of the spectrum, while in the second stage, our model aims to further suppress noise, reverberation, and artifacts introduced by the first-stage model. Achieving 0.446 in the final score and 0.517 in the P.835 score, our system ranks 4th in the non-real-time track.Comment: Accepted by ICASSP 202

    Vocoder-Based Speech Synthesis from Silent Videos

    Get PDF
    Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this paper, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over existing video-to-speech approaches.Comment: Accepted to Interspeech 202

    Towards More Efficient DNN-Based Speech Enhancement Using Quantized Correlation Mask

    Get PDF
    Many studies on deep learning-based speech enhancement (SE) utilizing the computational auditory scene analysis method typically employs the ideal binary mask or the ideal ratio mask to reconstruct the enhanced speech signal. However, many SE applications in real scenarios demand a desirable balance between denoising capability and computational cost. In this study, first, an improvement over the ideal ratio mask to attain more superior SE performance is proposed through introducing an efficient adaptive correlation-based factor for adjusting the ratio mask. The proposed method exploits the correlation coefficients among the noisy speech, noise and clean speech to effectively re-distribute the power ratio of the speech and noise during the ratio mask construction phase. Second, to make the supervised SE system more computationally-efficient, quantization techniques are considered to reduce the number of bits needed to represent floating numbers, leading to a more compact SE model. The proposed quantized correlation mask is utilized in conjunction with a 4-layer deep neural network (DNN-QCM) comprising dropout regulation, pre-training and noise-aware training to derive a robust and high-order mapping in enhancement, and to improve generalization capability in unseen conditions. Results show that the quantized correlation mask outperforms the conventional ratio mask representation and the other SE algorithms used for comparison. When compared to a DNN with ideal ratio mask as its learning targets, the DNN-QCM provided an improvement of approximately 6.5% in the short-time objective intelligibility score and 11.0% in the perceptual evaluation of speech quality score. The introduction of the quantization method can reduce the neural network weights to a 5-bit representation from a 32-bit, while effectively suppressing stationary and non-stationary noise. Timing analyses also show that with the techniques incorporated in the proposed DNN-QCM system to increase its compac..

    Real-time implementation of a dual microphone beamformer : a thesis presented in partial fulfilment of the requirements for the degree of Master of Engineering in Computer Systems at Massey University, Albany, New Zealand

    Get PDF
    The main objective of this project is to develop a microphone array system, which captures the speech signal for a speech related application. This system should allow the user to move freely and acquire the speech from adverse acoustic environments. The most important problem when the distance between the speaker and the microphone increases is that often the quality of the speech signal is degraded by background noise and reverberation. As a result, the speech related applications fails to perform well under these circumstances. This unwanted noise components present in the acquired signal have to be removed in order to improve the performance of these applications. This thesis contains the development of a dual microphone beamformer in a Digital Signal Processor (DSP). The development kit used in this project is the Texas Instruments TMS320C6711 DSP Starter Kit (DSK). The switched Griffiths-Jim beamformer was selected as the algorithm to be implemented in the DSK. Van Compernolle developed this algorithm in 1990 by modifying the Griffiths-Jim beamformer structure. This beamformer algorithm is used to improve the quality of the desired speech signal by reducing the background noise. This algorithm requires atleast two input channels to obtain the spatial characteristics of the acquired signal. Therefore, the PCM3003 audio daughter card is used to access the two microphone signals. The software implementation of the switched Griffiths-Jim beamformer algorithm has two main stages. The first stage is to identify the presence of speech in the acquired signal. A simple Voice Activity Detector (VAD) based on the energy of the acquired signal is used to distinguish between the wanted speech signal and the unwanted noise signals. The second stage is the adaptive beamformer, which uses the results obtained from the VAD algorithm to reduce the background noise. The adaptive beamformer consists of two adaptive filters based on the Normalised Least Mean Squares (NLMS) algorithm. The first filter behaves like a beam-steering filter and it's only updated during the presence of speech and noise signal. The second filter behaves like an Adaptive Noise Canceller (ANC) and it is only updated when a noise alone period is present. The VAD algorithm controls the updating process of these NLMS filters and only one of these filters is updated at any given time. This algorithm was successfully implemented in the chosen DSK using the Code Composer Studio (CCS) software. This implementation is tested in real-time using a speech recognition system. This system is programmed in Visual Basic software using the Microsoft Speech SDK components. This dual microphone system allows the user to move around freely and acquire the desired speech signal. The results show a reasonable amount of enhancement in the output signal, and a significant improvement in the ease of using the speech recognition system is achieved

    Development of Digital Signal Processing Platform for Digital Hearing Aid

    Get PDF
    ABSTRACT: There has been a tremendous growth for the past few years in the field of VLSI and real time signal processing. Number of signal processing algorithms have been developed which allow the user to process real time signals such as speech to accomplish the signal with desired quality. Development of CMOS technology provides wide selection of low power FPGAs to be used in Digital system design. This paper introduces an efficient method to design a Digital Programmable Hearing Aid system working in association with Analog interface module. Real time signal processing algorithms such as Feedback and echo cancellation, Adaptive filtering , Dynamic range compression, Digital filter for decimation along with FFT and IFFT algorithms have been discussed using a 32 bit RISC processor core and hardware digital signal processing block, both residing in a Xilinx FPGA. Hardware configuration and software implementation are discussed in detail. Xilinx Spartan3, FPGA is used in system design. Spartan3 board has inbuilt ADC inside the chip so there is no need to use external CODEC system in design. Finally the challenges and the future work for additional improvement are discussed

    Using deep learning methods for supervised speech enhancement in noisy and reverberant environments

    Get PDF
    In real world environments, the speech signals received by our ears are usually a combination of different sounds that include not only the target speech, but also acoustic interference like music, background noise, and competing speakers. This interference has negative effect on speech perception and degrades the performance of speech processing applications such as automatic speech recognition (ASR), speaker identification, and hearing aid devices. One way to solve this problem is using source separation algorithms to separate the desired speech from the interfering sounds. Many source separation algorithms have been proposed to improve the performance of ASR systems and hearing aid devices, but it is still challenging for these systems to work efficiently in noisy and reverberant environments. On the other hand, humans have a remarkable ability to separate desired sounds and listen to a specific talker among noise and other talkers. Inspired by the capabilities of human auditory system, a popular method known as auditory scene analysis (ASA) was proposed to separate different sources in a two stage process of segmentation and grouping. The main goal of source separation in ASA is to estimate time frequency masks that optimally match and separate noise signals from a mixture of speech and noise. In this work, multiple algorithms are proposed to improve upon source separation in noisy and reverberant acoustic environment. First, a simple and novel algorithm is proposed to increase the discriminability between two sound sources by scaling (magnifying) the head-related transfer function of the interfering source. Experimental results from applications of this algorithm show a significant increase in the quality of the recovered target speech. Second, a time frequency masking-based source separation algorithm is proposed that can separate a male speaker from a female speaker in reverberant conditions by using the spatial cues of the source signals. Furthermore, the proposed algorithm has the ability to preserve the location of the sources after separation. Three major aims are proposed for supervised speech separation based on deep neural networks to estimate either the time frequency masks or the clean speech spectrum. Firstly, a novel monaural acoustic feature set based on a gammatone filterbank is presented to be used as the input of the deep neural network (DNN) based speech separation model, which shows significant improvement in objective speech intelligibility and speech quality in different testing conditions. Secondly, a complementary binaural feature set is proposed to increase the ability of source separation in adverse environment with non-stationary background noise and high reverberation using 2-channel recordings. Experimental results show that the combination of spatial features with this complementary feature set improves significantly the speech intelligibility and speech quality in noisy and reverberant conditions. Thirdly, a novel dilated convolution neural network is proposed to improve the generalization of the monaural supervised speech enhancement model to different untrained speakers, unseen noises and simulated rooms. This model increases the speech intelligibility and speech quality of the recovered speech significantly, while being computationally more efficient and requiring less memory in comparison to other models. In addition, the proposed model is modified with recurrent layers and dilated causal convolution layers for real-time processing. This model is causal which makes it suitable for implementation in hearing aid devices and ASR system, while having fewer trainable parameters and using only information about previous time frames in output prediction. The main goal of the proposed algorithms are to increase the intelligibility and the quality of the recovered speech from noisy and reverberant environments, which has the potential to improve both speech processing applications and signal processing strategies for hearing aid and cochlear implant technology
    • …
    corecore