158 research outputs found

    DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement

    Full text link
    For the difficulty and large computational complexity of modeling more frequency bands, full-band speech enhancement based on deep neural networks is still challenging. Previous studies usually adopt compressed full-band speech features in Bark and ERB scale with relatively low frequency resolution, leading to degraded performance, especially in the high-frequency region. In this paper, we propose a decoupling-style multi-band fusion model to perform full-band speech denoising and dereverberation. Instead of optimizing the full-band speech by a single network structure, we decompose the full-band target into multi sub-band speech features and then employ a multi-stage chain optimization strategy to estimate clean spectrum stage by stage. Specifically, the low- (0-8 kHz), middle- (8-16 kHz), and high-frequency (16-24 kHz) regions are mapped by three separate sub-networks and are then fused to obtain the full-band clean target STFT spectrum. Comprehensive experiments on two public datasets demonstrate that the proposed method outperforms previous advanced systems and yields promising performance in terms of speech quality and intelligibility in real complex scenarios

    Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning

    Full text link
    In this paper, we explore a continuous modeling approach for deep-learning-based speech enhancement, focusing on the denoising process. We use a state variable to indicate the denoising process. The starting state is noisy speech and the ending state is clean speech. The noise component in the state variable decreases with the change of the state index until the noise component is 0. During training, a UNet-like neural network learns to estimate every state variable sampled from the continuous denoising process. In testing, we introduce a controlling factor as an embedding, ranging from zero to one, to the neural network, allowing us to control the level of noise reduction. This approach enables controllable speech enhancement and is adaptable to various application scenarios. Experimental results indicate that preserving a small amount of noise in the clean target benefits speech enhancement, as evidenced by improvements in both objective speech measures and automatic speech recognition performance

    Complexity Scaling for Speech Denoising

    Full text link
    Computational complexity is critical when deploying deep learning-based speech denoising models for on-device applications. Most prior research focused on optimizing model architectures to meet specific computational cost constraints, often creating distinct neural network architectures for different complexity limitations. This study conducts complexity scaling for speech denoising tasks, aiming to consolidate models with various complexities into a unified architecture. We present a Multi-Path Transform-based (MPT) architecture to handle both low- and high-complexity scenarios. A series of MPT networks present high performance covering a wide range of computational complexities on the DNS challenge dataset. Moreover, inspired by the scaling experiments in natural language processing, we explore the empirical relationship between model performance and computational cost on the denoising task. As the complexity number of multiply-accumulate operations (MACs) is scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which might contribute to the understanding and application of complexity scaling in speech denoising tasks.Comment: Submitted to ICASSP202

    Advanced Telecommunications and Signal Processing Program

    Get PDF
    Contains an introduction and reports on seven research projects.Advanced Telecommunications Research ProgramAT&T FellowshipGEM FellowshipU.S. Federal Bureau of InvestigationLucent Technologies FellowshipCharles S. Draper LaboratoryU.S. Navy - Office of Naval Research NDSEG Graduate Fellowshi

    Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition

    Get PDF
    An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good trade-off between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper, we present a detailed overview of this model and its applications to noise robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: (1) mask estimation, i.e. determining the reliability of the noisy features, and (2) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing data imputation techniques

    Speech Enhancement Based on LWT and Artificial Neural Network and Using MMSE Estimate of Spectral Amplitude

    Get PDF
    In this chapter, we will detail a new speech enhancement technique based on Lifting Wavelet Transform (LWT) and Artifitial Neural Network (ANN). This technique also uses the MMSE Estimate of Spectral Amplitude. It consists at the first step in applying the LWTto the noisy speech signal in order to obtain two noisy details coefficients, cD1 and cD2 and one approximation coefficient, cA2. After that, cD1 and cD2 are denoised by soft thresholding and for their thresholding, we need to use suitable thresholds, thrj,1≤j≤2. Those thresholds, thrj,1≤j≤2, are determined by using an Artificial Neural Network (ANN). The soft thresholding of those coefficients, cD1 and cD2, is performed in order to obtain two denoised coefficients, cDd1 and cDd2 . Then the denoising technique based on MMSE Estimate of Spectral Amplitude is applied to the noisy approximation cA2 in order to obtain a denoised coefficient, cAd2. Finally, the enhanced speech signal is obtained from the application of the inverse of LWT, LWT−1 to cDd1, cDd2 and cAd2. The performance of the proposed speech enhancement technique is justified by the computations of the Signal to Noise Ratio (SNR), Segmental SNR (SSNR) and Perceptual Evaluation of Speech Quality (PESQ)

    Speech Enhancement By Exploiting The Baseband Phase Structure Of Voiced Speech For Effective Non-Stationary Noise Estimation

    Get PDF
    Speech enhancement is one of the most important and challenging issues in the speech communication and signal processing field. It aims to minimize the effect of additive noise on the quality and intelligibility of the speech signal. Speech quality is the measure of noise remaining after the processing on the speech signal and of how pleasant the resulting speech sounds, while intelligibility refers to the accuracy of understanding speech. Speech enhancement algorithms are designed to remove the additive noise with minimum speech distortion.The task of speech enhancement is challenging due to lack of knowledge about the corrupting noise. Hence, the most challenging task is to estimate the noise which degrades the speech. Several approaches has been adopted for noise estimation which mainly fall under two categories: single channel algorithms and multiple channel algorithms. Due to this, the speech enhancement algorithms are also broadly classified as single and multiple channel enhancement algorithms.In this thesis, speech enhancement is studied in acoustic and modulation domains along with both amplitude and phase enhancement. We propose a noise estimation technique based on the spectral sparsity, detected by using the harmonic property of voiced segment of the speech. We estimate the frame to frame phase difference for the clean speech from available corrupted speech. This estimated frame-to-frame phase difference is used as a means of detecting the noise-only frequency bins even in voiced frames. This gives better noise estimation for the highly non-stationary noises like babble, restaurant and subway noise. This noise estimation along with the phase difference as an additional prior is used to extend the standard spectral subtraction algorithm. We also verify the effectiveness of this noise estimation technique when used with the Minimum Mean Squared Error Short Time Spectral Amplitude Estimator (MMSE STSA) speech enhancement algorithm. The combination of MMSE STSA and spectral subtraction results in further improvement of speech quality

    On detection of OFDM signals for cognitive radio applications

    Get PDF
    As the requirement for wireless telecommunications services continues to grow, it has become increasingly important to ensure that the Radio Frequency (RF) spectrum is managed efficiently. As a result of the current spectrum allocation policy, it has been found that portions of RF spectrum belonging to licensed users are often severely underutilised, at particular times and geographical locations. Awareness of this problem has led to the development of Dynamic Spectrum Access (DSA) and Cognitive Radio (CR) as possible solutions. In one variation of the shared-use model for DSA, it is proposed that the inefficient use of licensed spectrum could be overcome by enabling unlicensed users to opportunistically access the spectrum when the licensed user is not transmitting. In order for an unlicensed device to make decisions, it must be aware of its own RF environment and, therefore, it has been proposed that DSA could been abled using CR. One approach that has be identified to allow the CR to gain information about its operating environment is spectrum sensing. An interesting solution that has been identified for spectrum sensing is cyclostationary detection. This property refers to the inherent periodic nature of the second order statistics of many communications signals. One of the most common modulation formats in use today is Orthogonal Frequency Division Multiplexing (OFDM), which exhibits cyclostationarity due to the addition of a Cyclic Prefix (CP). This thesis examines several statistical tests for cyclostationarity in OFDM signals that may be used for spectrum sensing in DSA and CR. In particular, focus is placed on statistical tests that rely on estimation of the Cyclic Autocorrelation Function (CAF). Based on splitting the CAF into two complex component functions, several new statistical tests are introduced and are shown to lead to an improvement in detection performance when compared to the existing algorithms. The performance of each new algorithm is assessed in Additive White Gaussian Noise (AWGN), impulsive noise and when subjected to impairments such as multipath fading and Carrier Frequency Offset (CFO). Finally, each algorithm is targeted for Field Programmable Gate Array (FPGA) implementation using a Xilinx 7 series device. In order to keep resource costs to a minimum, it is suggested that the new algorithms are implemented on the FPGA using hardware sharing, and a simple mathematical re-arrangement of certain tests statistics is proposed to circumvent a costly division operation.As the requirement for wireless telecommunications services continues to grow, it has become increasingly important to ensure that the Radio Frequency (RF) spectrum is managed efficiently. As a result of the current spectrum allocation policy, it has been found that portions of RF spectrum belonging to licensed users are often severely underutilised, at particular times and geographical locations. Awareness of this problem has led to the development of Dynamic Spectrum Access (DSA) and Cognitive Radio (CR) as possible solutions. In one variation of the shared-use model for DSA, it is proposed that the inefficient use of licensed spectrum could be overcome by enabling unlicensed users to opportunistically access the spectrum when the licensed user is not transmitting. In order for an unlicensed device to make decisions, it must be aware of its own RF environment and, therefore, it has been proposed that DSA could been abled using CR. One approach that has be identified to allow the CR to gain information about its operating environment is spectrum sensing. An interesting solution that has been identified for spectrum sensing is cyclostationary detection. This property refers to the inherent periodic nature of the second order statistics of many communications signals. One of the most common modulation formats in use today is Orthogonal Frequency Division Multiplexing (OFDM), which exhibits cyclostationarity due to the addition of a Cyclic Prefix (CP). This thesis examines several statistical tests for cyclostationarity in OFDM signals that may be used for spectrum sensing in DSA and CR. In particular, focus is placed on statistical tests that rely on estimation of the Cyclic Autocorrelation Function (CAF). Based on splitting the CAF into two complex component functions, several new statistical tests are introduced and are shown to lead to an improvement in detection performance when compared to the existing algorithms. The performance of each new algorithm is assessed in Additive White Gaussian Noise (AWGN), impulsive noise and when subjected to impairments such as multipath fading and Carrier Frequency Offset (CFO). Finally, each algorithm is targeted for Field Programmable Gate Array (FPGA) implementation using a Xilinx 7 series device. In order to keep resource costs to a minimum, it is suggested that the new algorithms are implemented on the FPGA using hardware sharing, and a simple mathematical re-arrangement of certain tests statistics is proposed to circumvent a costly division operation

    Echo Cancellation for Hands-Free Systems

    Get PDF

    SwG-former: Sliding-window Graph Convolutional Network Integrated with Conformer for Sound Event Localization and Detection

    Full text link
    Sound event localization and detection (SELD) is a joint task of sound event detection (SED) and direction of arrival (DoA) estimation. SED mainly relies on temporal dependencies to distinguish different sound classes, while DoA estimation depends on spatial correlations to estimate source directions. To jointly optimize two subtasks, the SELD system should extract spatial correlations and model temporal dependencies simultaneously. However, numerous models mainly extract spatial correlations and model temporal dependencies separately. In this paper, the interdependence of spatial-temporal information in audio signals is exploited for simultaneous extraction to enhance the model performance. In response, a novel graph representation leveraging graph convolutional network (GCN) in non-Euclidean space is developed to extract spatial-temporal information concurrently. A sliding-window graph (SwG) module is designed based on the graph representation. It exploits sliding-windows with different sizes to learn temporal context information and dynamically constructs graph vertices in the frequency-channel (F-C) domain to capture spatial correlations. Furthermore, as the cornerstone of message passing, a robust Conv2dAgg function is proposed and embedded into the SwG module to aggregate the features of neighbor vertices. To improve the performance of SELD in a natural spatial acoustic environment, a general and efficient SwG-former model is proposed by integrating the SwG module with the Conformer. It exhibits superior performance in comparison to recent advanced SELD models. To further validate the generality and efficiency of the SwG-former, it is seamlessly integrated into the event-independent network version 2 (EINV2) called SwG-EINV2. The SwG-EINV2 surpasses the state-of-the-art (SOTA) methods under the same acoustic environment
    • …
    corecore