158 research outputs found
DMF-Net: A decoupling-style multi-band fusion model for full-band speech enhancement
For the difficulty and large computational complexity of modeling more
frequency bands, full-band speech enhancement based on deep neural networks is
still challenging. Previous studies usually adopt compressed full-band speech
features in Bark and ERB scale with relatively low frequency resolution,
leading to degraded performance, especially in the high-frequency region. In
this paper, we propose a decoupling-style multi-band fusion model to perform
full-band speech denoising and dereverberation. Instead of optimizing the
full-band speech by a single network structure, we decompose the full-band
target into multi sub-band speech features and then employ a multi-stage chain
optimization strategy to estimate clean spectrum stage by stage. Specifically,
the low- (0-8 kHz), middle- (8-16 kHz), and high-frequency (16-24 kHz) regions
are mapped by three separate sub-networks and are then fused to obtain the
full-band clean target STFT spectrum. Comprehensive experiments on two public
datasets demonstrate that the proposed method outperforms previous advanced
systems and yields promising performance in terms of speech quality and
intelligibility in real complex scenarios
Continuous Modeling of the Denoising Process for Speech Enhancement Based on Deep Learning
In this paper, we explore a continuous modeling approach for
deep-learning-based speech enhancement, focusing on the denoising process. We
use a state variable to indicate the denoising process. The starting state is
noisy speech and the ending state is clean speech. The noise component in the
state variable decreases with the change of the state index until the noise
component is 0. During training, a UNet-like neural network learns to estimate
every state variable sampled from the continuous denoising process. In testing,
we introduce a controlling factor as an embedding, ranging from zero to one, to
the neural network, allowing us to control the level of noise reduction. This
approach enables controllable speech enhancement and is adaptable to various
application scenarios. Experimental results indicate that preserving a small
amount of noise in the clean target benefits speech enhancement, as evidenced
by improvements in both objective speech measures and automatic speech
recognition performance
Complexity Scaling for Speech Denoising
Computational complexity is critical when deploying deep learning-based
speech denoising models for on-device applications. Most prior research focused
on optimizing model architectures to meet specific computational cost
constraints, often creating distinct neural network architectures for different
complexity limitations. This study conducts complexity scaling for speech
denoising tasks, aiming to consolidate models with various complexities into a
unified architecture. We present a Multi-Path Transform-based (MPT)
architecture to handle both low- and high-complexity scenarios. A series of MPT
networks present high performance covering a wide range of computational
complexities on the DNS challenge dataset. Moreover, inspired by the scaling
experiments in natural language processing, we explore the empirical
relationship between model performance and computational cost on the denoising
task. As the complexity number of multiply-accumulate operations (MACs) is
scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the
values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which
might contribute to the understanding and application of complexity scaling in
speech denoising tasks.Comment: Submitted to ICASSP202
Advanced Telecommunications and Signal Processing Program
Contains an introduction and reports on seven research projects.Advanced Telecommunications Research ProgramAT&T FellowshipGEM FellowshipU.S. Federal Bureau of InvestigationLucent Technologies FellowshipCharles S. Draper LaboratoryU.S. Navy - Office of Naval Research NDSEG Graduate Fellowshi
Spectral Reconstruction and Noise Model Estimation Based on a Masking Model for Noise Robust Speech Recognition
An effective way to increase noise robustness in automatic speech recognition (ASR) systems is feature enhancement based on an analytical distortion model that describes the effects of noise on the speech features. One of such distortion models that has been reported to achieve a good trade-off between accuracy and simplicity is the masking model. Under this model, speech distortion caused by environmental noise is seen as a spectral mask and, as a result, noisy speech features can be either reliable (speech is not masked by noise) or unreliable (speech is masked). In this paper, we present a detailed overview of this model and its applications to noise robust ASR. Firstly, using the masking model, we derive a spectral reconstruction technique aimed at enhancing the noisy speech features. Two problems must be solved in order to perform spectral reconstruction using the masking model: (1) mask estimation, i.e. determining the reliability of the noisy features, and (2) feature imputation, i.e. estimating speech for the unreliable features. Unlike missing data imputation techniques where the two problems are considered as independent, our technique jointly addresses them by exploiting a priori knowledge of the speech and noise sources in the form of a statistical model. Secondly, we propose an algorithm for estimating the noise model required by the feature enhancement technique. The proposed algorithm fits a Gaussian mixture model to the noise by iteratively maximising the likelihood of the noisy speech signal so that noise can be estimated even during speech-dominating frames. A comprehensive set of experiments carried out on the Aurora-2 and Aurora-4 databases shows that the proposed method achieves significant improvements over the baseline system and other similar missing data imputation techniques
Speech Enhancement Based on LWT and Artificial Neural Network and Using MMSE Estimate of Spectral Amplitude
In this chapter, we will detail a new speech enhancement technique based on Lifting Wavelet Transform (LWT) and Artifitial Neural Network (ANN). This technique also uses the MMSE Estimate of Spectral Amplitude. It consists at the first step in applying the LWTto the noisy speech signal in order to obtain two noisy details coefficients, cD1 and cD2 and one approximation coefficient, cA2. After that, cD1 and cD2 are denoised by soft thresholding and for their thresholding, we need to use suitable thresholds, thrj,1≤j≤2. Those thresholds, thrj,1≤j≤2, are determined by using an Artificial Neural Network (ANN). The soft thresholding of those coefficients, cD1 and cD2, is performed in order to obtain two denoised coefficients, cDd1 and cDd2 . Then the denoising technique based on MMSE Estimate of Spectral Amplitude is applied to the noisy approximation cA2 in order to obtain a denoised coefficient, cAd2. Finally, the enhanced speech signal is obtained from the application of the inverse of LWT, LWT−1 to cDd1, cDd2 and cAd2. The performance of the proposed speech enhancement technique is justified by the computations of the Signal to Noise Ratio (SNR), Segmental SNR (SSNR) and Perceptual Evaluation of Speech Quality (PESQ)
Speech Enhancement By Exploiting The Baseband Phase Structure Of Voiced Speech For Effective Non-Stationary Noise Estimation
Speech enhancement is one of the most important and challenging issues in the speech communication and signal processing field. It aims to minimize the effect of additive noise on the quality and intelligibility of the speech signal. Speech quality is the measure of noise remaining after the processing on the speech signal and of how pleasant the resulting speech sounds, while intelligibility refers to the accuracy of understanding speech. Speech enhancement algorithms are designed to remove the additive noise with minimum speech distortion.The task of speech enhancement is challenging due to lack of knowledge about the corrupting noise. Hence, the most challenging task is to estimate the noise which degrades the speech. Several approaches has been adopted for noise estimation which mainly fall under two categories: single channel algorithms and multiple channel algorithms. Due to this, the speech enhancement algorithms are also broadly classified as single and multiple channel enhancement algorithms.In this thesis, speech enhancement is studied in acoustic and modulation domains along with both amplitude and phase enhancement. We propose a noise estimation technique based on the spectral sparsity, detected by using the harmonic property of voiced segment of the speech. We estimate the frame to frame phase difference for the clean speech from available corrupted speech. This estimated frame-to-frame phase difference is used as a means of detecting the noise-only frequency bins even in voiced frames. This gives better noise estimation for the highly non-stationary noises like babble, restaurant and subway noise. This noise estimation along with the phase difference as an additional prior is used to extend the standard spectral subtraction algorithm. We also verify the effectiveness of this noise estimation technique when used with the Minimum Mean Squared Error Short Time Spectral Amplitude Estimator (MMSE STSA) speech enhancement algorithm. The combination of MMSE STSA and spectral subtraction results in further improvement of speech quality
On detection of OFDM signals for cognitive radio applications
As the requirement for wireless telecommunications services continues to grow, it has become increasingly important to ensure that the Radio Frequency (RF) spectrum is managed efficiently. As a result of the current spectrum allocation policy, it has been found that portions of RF spectrum belonging to licensed users are often severely underutilised, at particular times and geographical locations. Awareness of this problem has led to the development of Dynamic Spectrum Access (DSA) and Cognitive Radio (CR) as possible solutions. In one variation of the shared-use model for DSA, it is proposed that the inefficient use of licensed spectrum could be overcome by enabling unlicensed users to opportunistically access the spectrum when the licensed user is not transmitting. In order for an unlicensed device to make decisions, it must be aware of its own RF environment and, therefore, it has been proposed that DSA could been abled using CR. One approach that has be identified to allow the CR to gain information about its operating environment is spectrum sensing. An interesting solution that has been identified for spectrum sensing is cyclostationary detection. This property refers to the inherent periodic nature of the second order statistics of many communications signals. One of the most common modulation formats in use today is Orthogonal Frequency Division Multiplexing (OFDM), which exhibits cyclostationarity due to the addition of a Cyclic Prefix (CP). This thesis examines several statistical tests for cyclostationarity in OFDM signals that may be used for spectrum sensing in DSA and CR. In particular, focus is placed on statistical tests that rely on estimation of the Cyclic Autocorrelation Function (CAF). Based on splitting the CAF into two complex component functions, several new statistical tests are introduced and are shown to lead to an improvement in detection performance when compared to the existing algorithms. The performance of each new algorithm is assessed in Additive White Gaussian Noise (AWGN), impulsive noise and when subjected to impairments such as multipath fading and Carrier Frequency Offset (CFO). Finally, each algorithm is targeted for Field Programmable Gate Array (FPGA) implementation using a Xilinx 7 series device. In order to keep resource costs to a minimum, it is suggested that the new algorithms are implemented on the FPGA using hardware sharing, and a simple mathematical re-arrangement of certain tests statistics is proposed to circumvent a costly division operation.As the requirement for wireless telecommunications services continues to grow, it has become increasingly important to ensure that the Radio Frequency (RF) spectrum is managed efficiently. As a result of the current spectrum allocation policy, it has been found that portions of RF spectrum belonging to licensed users are often severely underutilised, at particular times and geographical locations. Awareness of this problem has led to the development of Dynamic Spectrum Access (DSA) and Cognitive Radio (CR) as possible solutions. In one variation of the shared-use model for DSA, it is proposed that the inefficient use of licensed spectrum could be overcome by enabling unlicensed users to opportunistically access the spectrum when the licensed user is not transmitting. In order for an unlicensed device to make decisions, it must be aware of its own RF environment and, therefore, it has been proposed that DSA could been abled using CR. One approach that has be identified to allow the CR to gain information about its operating environment is spectrum sensing. An interesting solution that has been identified for spectrum sensing is cyclostationary detection. This property refers to the inherent periodic nature of the second order statistics of many communications signals. One of the most common modulation formats in use today is Orthogonal Frequency Division Multiplexing (OFDM), which exhibits cyclostationarity due to the addition of a Cyclic Prefix (CP). This thesis examines several statistical tests for cyclostationarity in OFDM signals that may be used for spectrum sensing in DSA and CR. In particular, focus is placed on statistical tests that rely on estimation of the Cyclic Autocorrelation Function (CAF). Based on splitting the CAF into two complex component functions, several new statistical tests are introduced and are shown to lead to an improvement in detection performance when compared to the existing algorithms. The performance of each new algorithm is assessed in Additive White Gaussian Noise (AWGN), impulsive noise and when subjected to impairments such as multipath fading and Carrier Frequency Offset (CFO). Finally, each algorithm is targeted for Field Programmable Gate Array (FPGA) implementation using a Xilinx 7 series device. In order to keep resource costs to a minimum, it is suggested that the new algorithms are implemented on the FPGA using hardware sharing, and a simple mathematical re-arrangement of certain tests statistics is proposed to circumvent a costly division operation
SwG-former: Sliding-window Graph Convolutional Network Integrated with Conformer for Sound Event Localization and Detection
Sound event localization and detection (SELD) is a joint task of sound event
detection (SED) and direction of arrival (DoA) estimation. SED mainly relies on
temporal dependencies to distinguish different sound classes, while DoA
estimation depends on spatial correlations to estimate source directions. To
jointly optimize two subtasks, the SELD system should extract spatial
correlations and model temporal dependencies simultaneously. However, numerous
models mainly extract spatial correlations and model temporal dependencies
separately. In this paper, the interdependence of spatial-temporal information
in audio signals is exploited for simultaneous extraction to enhance the model
performance. In response, a novel graph representation leveraging graph
convolutional network (GCN) in non-Euclidean space is developed to extract
spatial-temporal information concurrently. A sliding-window graph (SwG) module
is designed based on the graph representation. It exploits sliding-windows with
different sizes to learn temporal context information and dynamically
constructs graph vertices in the frequency-channel (F-C) domain to capture
spatial correlations. Furthermore, as the cornerstone of message passing, a
robust Conv2dAgg function is proposed and embedded into the SwG module to
aggregate the features of neighbor vertices. To improve the performance of SELD
in a natural spatial acoustic environment, a general and efficient SwG-former
model is proposed by integrating the SwG module with the Conformer. It exhibits
superior performance in comparison to recent advanced SELD models. To further
validate the generality and efficiency of the SwG-former, it is seamlessly
integrated into the event-independent network version 2 (EINV2) called
SwG-EINV2. The SwG-EINV2 surpasses the state-of-the-art (SOTA) methods under
the same acoustic environment
- …