3 research outputs found

    A Compact CNN-Based Speech Enhancement With Adaptive Filter Design Using Gabor Function And Region-Aware Convolution

    Get PDF
    Speech enhancement (SE) is used in many applications, such as hearing devices, to improve speech intelligibility and quality. Convolutional neural network-based (CNN-based) SE algorithms in literature often employ generic convolutional filters that are not optimized for SE applications. This paper presents a CNN-based SE algorithm with an adaptive filter design (named ‘CNN-AFD’) using Gabor function and region-aware convolution. The proposed algorithm incorporates fixed Gabor functions into convolutional filters to model human auditory processing for improved denoising performance. The feature maps obtained from the Gabor-incorporated convolutional layers serve as learnable guided masks (tuned at backpropagation) for generating adaptive custom region-aware filters. The custom filters extract features from speech regions (i.e., ‘region-aware’) while maintaining translation-invariance. To reduce the high cost of inference of the CNN, skip convolution and activation analysis-wise pruning are explored. Employing skip convolution allowed the training time per epoch to be reduced by close to 40%. Pruning of neurons with high numbers of zero activations complements skip convolution and significantly reduces model parameters by more than 30%. The proposed CNN-AFD outperformed all four CNN-based SE baseline algorithms (i.e., a CNN-based SE employing generic filters, a CNN-based SE without region-aware convolution, a CNN-based SE trained with complex spectrograms and a CNN-based SE processing in the time-domain) with an average of 0.95, 1.82 and 0.82 in short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ) and logarithmic spectral distance (LSD) scores, respectively, when tasked to denoise speech contaminated with NOISEX-92 noises at −5, 0 and 5 dB signal-to-noise ratios (SNRs)

    Hardware efficient speech enhancement with noise aware multi-target deep learning

    No full text
    This paper describes a supervised speech enhancement (SE) method utilising a noise-aware four-layer deep neural network and training target switching. For optimal speech denoising, the SE system, trained with multiple-target joint learning, switches between mapping-based, masking-based, or complementary processing, depending on the level of noise contamination detected. Optimisation techniques, including ternary quantisation, structural pruning, efficient sparse matrix representation and cost-effective approximations for complex computations, were implemented to reduce area, memory, and power requirements. Up to 19.1x compression was obtained, and all weights could be stored on the on-chip memory. When processing NOISEX-92 noises, the system achieved an average short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) scores of 0.81 and 1.62, respectively, outperforming SE algorithms trained with only a single learning target. The proposed SE processor was implemented on a field programmable gate array (FPGA) for proof of concept. Mapping the design on a 65-nm CMOS process led to a chip core area of 3.88~mm^{2} and a power consumption of 1.91 mW when operating at a 10 MHz clock frequency.</p

    A compact CNN-based speech enhancement with adaptive filter design using Gabor function and region-aware convolution

    No full text
    Speech enhancement (SE) is used in many applications, such as hearing devices, to improve speech intelligibility and quality. Convolutional neural network-based (CNN-based) SE algorithms in literature often employ generic convolutional filters that are not optimized for SE applications. This paper presents a CNN-based SE algorithm with an adaptive filter design (named 'CNN-AFD') using Gabor function and region-aware convolution. The proposed algorithm incorporates fixed Gabor functions into convolutional filters to model human auditory processing for improved denoising performance. The feature maps obtained from the Gabor-incorporated convolutional layers serve as learnable guided masks (tuned at backpropagation) for generating adaptive custom region-aware filters. The custom filters extract features from speech regions (i.e., 'region-aware') while maintaining translation-invariance. To reduce the high cost of inference of the CNN, skip convolution and activation analysis-wise pruning are explored. Employing skip convolution allowed the training time per epoch to be reduced by close to 40%. Pruning of neurons with high numbers of zero activations complements skip convolution and significantly reduces model parameters by more than 30%. The proposed CNN-AFD outperformed all four CNN-based SE baseline algorithms (i.e., a CNN-based SE employing generic filters, a CNN-based SE without region-aware convolution, a CNN-based SE trained with complex spectrograms and a CNN-based SE processing in the time-domain) with an average of 0.95, 1.82 and 0.82 in short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ) and logarithmic spectral distance (LSD) scores, respectively, when tasked to denoise speech contaminated with NOISEX-92 noises at -5, 0 and 5 dB signal-to-noise ratios (SNRs).</p
    corecore