7 research outputs found
Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement
Deep learning algorithm are increasingly used for speech enhancement (SE). In
supervised methods, global and local information is required for accurate
spectral mapping. A key restriction is often poor capture of key contextual
information. To leverage long-term for target speakers and compensate
distortions of cleaned speech, this paper adopts a sequence-to-sequence (S2S)
mapping structure and proposes a novel monaural speech enhancement system,
consisting of a Feature Extraction Block (FEB), a Compensation Enhancement
Block (ComEB) and a Mask Block (MB). In the FEB a U-net block is used to
extract abstract features using complex-valued spectra with one path to
suppress the background noise in the magnitude domain using masking methods and
the MB takes magnitude features from the FEBand compensates the lost
complex-domain features produced from ComEB to restore the final cleaned
speech. Experiments are conducted on the Librispeech dataset and results show
that the proposed model obtains better performance than recent models in terms
of ESTOI and PESQ scores.Comment: 5 pages, 6 figures, references adde
Model-based Speech Enhancement for Intelligibility Improvement in Binaural Hearing Aids
Speech intelligibility is often severely degraded among hearing impaired
individuals in situations such as the cocktail party scenario. The performance
of the current hearing aid technology has been observed to be limited in these
scenarios. In this paper, we propose a binaural speech enhancement framework
that takes into consideration the speech production model. The enhancement
framework proposed here is based on the Kalman filter that allows us to take
the speech production dynamics into account during the enhancement process. The
usage of a Kalman filter requires the estimation of clean speech and noise
short term predictor (STP) parameters, and the clean speech pitch parameters.
In this work, a binaural codebook-based method is proposed for estimating the
STP parameters, and a directional pitch estimator based on the harmonic model
and maximum likelihood principle is used to estimate the pitch parameters. The
proposed method for estimating the STP and pitch parameters jointly uses the
information from left and right ears, leading to a more robust estimation of
the filter parameters. Objective measures such as PESQ and STOI have been used
to evaluate the enhancement framework in different acoustic scenarios
representative of the cocktail party scenario. We have also conducted
subjective listening tests on a set of nine normal hearing subjects, to
evaluate the performance in terms of intelligibility and quality improvement.
The listening tests show that the proposed algorithm, even with access to only
a single channel noisy observation, significantly improves the overall speech
quality, and the speech intelligibility by up to 15%.Comment: after revisio
Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication
Speech enhancement has been researched extensively for many years to provide high-quality speech communication in the presence of background noise and concurrent interference signals. Human listening is robust against these acoustic interferences using only two ears, but state-of-the-art two-channel algorithms function poorly. Motivated by psychoacoustic studies of binaural hearing (equalization–cancellation (EC) theory), in this paper, we propose a two-stage binaural speech enhancement with Wiener filter (TS-BASE/WF) approach that is a two-input two-output system. In this proposed TS-BASE/WF, interference signals are first estimated by equalizing and cancelling the target signal in a way inspired by the EC theory, a time-variant Wiener filter is then applied to enhance the target signal given the noisy mixture signals. The main advantages of the proposed TS-BASE/WF are (1) effectiveness in dealing with non-stationary multiple-source interference signals, and (2) success in preserving binaural cues after processing. These advantages were confirmed according to the comprehensive objective and subjective evaluations in different acoustical spatial configurations in terms of speech enhancement and binaural cue preservation
Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System
This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of
this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded.
In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments.
Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks
are made along with proposals for future work