40 research outputs found
An Indirect Speech Enhancement Framework Through Intermediate Noisy Speech Targets
Noise presents a severe challenge in speech communication and processing systems. Speech enhancement aims at removing the inference and restoring speech quality. It is an essential step in a speech processing pipeline in many modern electronic devices, such as mobile phones and smart speakers. Traditionally, speech engineers have relied on signal processing techniques, such as spectral subtraction or Wiener filtering. Since the advent of deep learning, data-driven methods have offered an alternative solution to speech enhancement. Researchers and engineers have proposed various neural network architectures to map noisy speech features into clean ones. In this thesis, we refer to this class of mapping based data-driven techniques collectively as a direct method in speech enhancement. The output speech from direct mapping methods usually contains noise residue and unpleasant distortion if the speech power is low relative to the noise power or the background noise is very complex. The former adverse condition refers to low signal-to-noise-ratio (SNR). The latter condition implies difficult noise types. Researchers have proposed improving the SNR of speech signal incrementally during enhancement to overcome such difficulty, known as SNR-progressive speech enhancement. This design breaks down the problem of direct mapping into manageable sub-tasks. Inspired by the previous work, we propose to adopt a multi-stage indirect approach to speech enhancement in challenging noise conditions. Unlike SNR-progressive speech enhancement, we gradually transform noisy speech from difficult background noise to speech in simple noise types. The thesis's focus will include the characterization of background noise, speech transformation techniques, and integration of an indirect speech enhancement system.Ph.D
Recommended from our members
Single Channel auditory source separation with neural network
Although distinguishing different sounds in noisy environment is a relative easy task for human, source separation has long been extremely difficult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds.
In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech.
In this thesis, we firstly propose extensions for the current mask learning network, for the problem of speech enhancement, to fix the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under different recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data.
Then we propose a novel neural network based model called “deep clustering” for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step “decodes” the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB.
We then propose an extension for deep clustering named “deep attractor” network that allows the system to perform efficient end-to-end training. In the proposed model, attractor points for each source are firstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by finding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results.
Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classification based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system firstly decodes the user’s attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience
Source Separation in the Presence of Side-information
The source separation problem involves the separation of unknown signals from their mixture. This problem is relevant in a wide range of applications from audio signal processing, communication, biomedical signal processing and art investigation to name a few. There is a vast literature on this problem which is based on either making strong assumption on the source signals or availability of additional data. This thesis proposes new algorithms for source separation with side information where one observes the linear superposition of two source signals plus two additional signals that are correlated with the mixed ones. The first algorithm is based on two ingredients: first, we learn a Gaussian mixture model (GMM) for the joint distribution of a source signal and the corresponding correlated side information signal; second, we separate the signals using standard computationally efficient conditional mean estimators. This also puts forth new recovery guarantees for this source separation algorithm. In particular, under the assumption that the signals can be perfectly described by a GMM model, we characterize necessary and sufficient conditions for reliable source separation in the asymptotic regime of low-noise as a function of the geometry of the underlying signals and their interaction. It is shown that if the subspaces spanned by the innovation components of the source signals with respect to the side information signals have zero intersection, provided that we observe a certain number of linear measurements from the mixture, then we can reliably separate the sources; otherwise we cannot. The second algorithms is based on deep learning where we introduce a novel self-supervised algorithm for the source separation problem. Source separation is intrinsically unsupervised and the lack of training data makes it a difficult task for artificial intelligence to solve. The proposed framework takes advantage of the available data and delivers near perfect separation results in real data scenarios. Our proposed frameworks – which provide new ways to incorporate side information to aid the solution of the source separation problem – are also employed in a real-world art investigation application involving the separation of mixtures of X-Ray images. The simulation results showcase the superiority of our algorithm against other state-of-the-art algorithms
Artificial Intelligence for Multimedia Signal Processing
Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining
A Robust Hybrid Neural Network Architecture for Blind Source Separation of Speech Signals Exploiting Deep Learning
In the contemporary era, blind source separation has emerged as a highly appealing and significant research topic within the field of signal processing. The imperative for the integration of blind source separation techniques within the context of beyond fifth-generation and sixth-generation networks arises from the increasing demand for reliable and efficient communication systems that can effectively handle the challenges posed by high-density networks, dynamic interference environments, and the coexistence of diverse signal sources, thereby enabling enhanced signal extraction and separation for improved system performance. Particularly, audio processing presents a critical domain where the challenge lies in effectively handling files containing a mixture of human speech, silence, and music. Addressing this challenge, speech separation systems can be regarded as a specialized form of human speech recognition or audio signal classification systems that are leveraged to separate, identify, or delineate segments of audio signals encompassing human speech. In various applications such as volume reduction, quality enhancement, detection, and identification, the need arises to separate human speech by eliminating silence, music, or environmental noise from the audio signals. Consequently, the development of robust methods for accurate and efficient speech separation holds paramount importance in optimizing audio signal processing tasks. This study proposes a novel three-way neural network architecture that incorporates transfer learning, a pre-trained dual-path recurrent neural network, and a transformer. In addition to learning the time series associated with audio signals, this network possesses the unique capability of direct context-awareness for modeling the speech sequence within the transformer framework. A comprehensive array of simulations is meticulously conducted to evaluate the performance of the proposed model, which is benchmarked with seven prominent state-of-the-art deep learning-based architectures. The results obtained from these evaluations demonstrate notable advancements in multiple objective metrics. Specifically, our proposed solution showcases an average improvement of 4.60% in terms of short-time objective intelligibility, 14.84% in source-to-distortion ratio, and 9.87% in scale-invariant signal-to-noise ratio. These extraordinary advancements surpass those achieved by the nearest rival, namely the dual-path recurrent neural network time-domain audio separation network, firmly establishing the superiority of our proposed model's performance
Deep Learning Based Speech Enhancement and Its Application to Speech Recognition
Speech enhancement is the task that aims to improve the quality and the intelligibility of a speech signal that is degraded by ambient noise and room reverberation. Speech enhancement algorithms are used extensively in many audio- and communication systems, including mobile handsets, speech recognition, speaker verification systems and hearing aids. Recently, deep learning has achieved great success in many applications, such as computer vision, nature language processing and speech recognition. Speech enhancement methods have been introduced that use deep-learning techniques, as these techniques are capable of learning complex hierarchical functions using large-scale training data. This dissertation investigates the deep learning based speech enhancement and its application to robust Automatic Speech Recognition (ASR).
We start our work by exploring generative adversarial network (GAN) based speech enhancement. We explore the techniques to extract information about the noise to aid in the reconstruction of the speech signals. The proposed framework, referred to as ForkGAN, is a novel general adversarial learning-based framework that combines deep-learning with conventional noise reduction techniques. We further extend ForkGAN to M-ForkGAN, which integrates feature mapping and mask learning into a unified framework using ForkGAN. Another variant of ForkGAN, named S-ForkGAN, operates on spectral-domain features, which could directly apply to ASR. Systematic evaluations demonstrate the effectiveness of the proposed approaches.
Then, we propose a novel multi-stage learning speech enhancement system. Each stage comprises a self-attention (SA) block followed by stacks of temporal convolutional network (TCN) blocks with doubling dilation factors. Each stage generates a prediction that is refined in a subsequent stage. A fusion block is inserted at the input of later stages to re-inject original information. Moreover, we design several multi-scale architectures with perceptual loss. Experiments show that our proposed architectures can achieve the state of the art performance on several public datasets.
Recently, modeling to learn the acoustic noisy-clean speech mapping has been enhanced by including auxiliary information such as visual cues, phonetic and linguistic information, and speaker information. We propose a novel speaker-aware speech enhancement (SASE) method that extracts speaker information from a clean reference using long short-term memory (LSTM) layers, and then uses a convolutional recurrent neural network (CRN) to embed the extracted speaker information. The SASE framework is extended with a self-attention mechanism. It is shown that a few seconds of clean reference speech is sufficient, and that the proposed SASE method performs well for a wide range of scenarios.
Even though speech enhancement methods that are based on deep learning have demonstrated state-of-the-art performance when compared with conventional methodologies, current deep learning approaches heavily rely on supervised learning, which requires a large number of noisy- and clean-speech sample pairs for training. This is generally not practical in a realistic environment. One cannot simultaneously obtain both noisy and clean speech samples. Thus, most speech enhancement approaches are trained with simulated speech and clean targets. In addition, it would be hard to collect large-scale dataset for the low-resource languages. We propose a novel noise-to-noise speech enhancement (N2N-SE) method that addresses the parallel noisy-clean training data issue, we leverage signal reconstruction techniques by only using corrupted speech. The proposed N2N-SE framework includes a noise conversion module that is an auto-encoder that learns to mix noise with speech, and a speech enhancement module, that learns to reconstruct corrupted speech signals.
In addition to additive noise, speech is also affected by reverberation, which is caused by the attenuated and delayed reflections of sound waves. These distortions, particularly when combined, can severely degrade speech intelligibility for human listeners and impact applications, e.g., automatic speech recognition (ASR) and speaker recognition. Thus, effective speech denoising and dereverberation will benefit both speech processing applications and human listeners. We investigate the deep-learning based approaches for both speech dereverberation and speech denoising using the cascade Conformer architecture. The experimental results show that the proposed cascade Conformer can be effective to suppress the noise and reverberation