1,897 research outputs found
Spectral analysis for nonstationary audio
A new approach for the analysis of nonstationary signals is proposed, with a
focus on audio applications. Following earlier contributions, nonstationarity
is modeled via stationarity-breaking operators acting on Gaussian stationary
random signals. The focus is on time warping and amplitude modulation, and an
approximate maximum-likelihood approach based on suitable approximations in the
wavelet transform domain is developed. This paper provides theoretical analysis
of the approximations, and introduces JEFAS, a corresponding estimation
algorithm. The latter is tested and validated on synthetic as well as real
audio signal.Comment: IEEE/ACM Transactions on Audio, Speech and Language Processing,
Institute of Electrical and Electronics Engineers, In pres
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Phase information has a significant impact on speech perceptual quality and
intelligibility. However, existing speech enhancement methods encounter
limitations in explicit phase estimation due to the non-structural nature and
wrapping characteristics of the phase, leading to a bottleneck in enhanced
speech quality. To overcome the above issue, in this paper, we proposed
MP-SENet, a novel Speech Enhancement Network which explicitly enhances
Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec
architecture in which the encoder and decoder are bridged by time-frequency
Transformers along both time and frequency dimensions. The encoder aims to
encode time-frequency representations derived from the input distorted
magnitude and phase spectra. The decoder comprises dual-stream magnitude and
phase decoders, directly enhancing magnitude and wrapped phase spectra by
incorporating a magnitude estimation architecture and a phase parallel
estimation architecture, respectively. To train the MP-SENet model effectively,
we define multi-level loss functions, including mean square error and
perceptual metric loss of magnitude spectra, anti-wrapping loss of phase
spectra, as well as mean square error and consistency loss of short-time
complex spectra. Experimental results demonstrate that our proposed MP-SENet
excels in high-quality speech enhancement across multiple tasks, including
speech denoising, dereverberation, and bandwidth extension. Compared to
existing phase-aware speech enhancement methods, it successfully avoids the
bidirectional compensation effect between the magnitude and phase, leading to a
better harmonic restoration. Notably, for the speech denoising task, the
MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the
public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language
Processin
SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking
With the advancements in deep learning approaches, the performance of speech
enhancing systems in the presence of background noise have shown significant
improvements. However, improving the system's robustness against reverberation
is still a work in progress, as reverberation tends to cause loss of formant
structure due to smearing effects in time and frequency. A wide range of deep
learning-based systems either enhance the magnitude response and reuse the
distorted phase or enhance complex spectrogram using a complex time-frequency
mask. Though these approaches have demonstrated satisfactory performance, they
do not directly address the lost formant structure caused by reverberation. We
believe that retrieving the formant structure can help improve the efficiency
of existing systems. In this study, we propose SkipConvGAN - an extension of
our prior work SkipConvNet. The proposed system's generator network tries to
estimate an efficient complex time-frequency mask, while the discriminator
network aids in driving the generator to restore the lost formant structure. We
evaluate the performance of our proposed system on simulated and real
recordings of reverberant speech from the single-channel task of the REVERB
challenge corpus. The proposed system shows a consistent improvement across
multiple room configurations over other deep learning-based generative
adversarial frameworks.Comment: Published in: IEEE/ACM Transactions on Audio, Speech, and Language
Processing ( Volume: 30
Deep learning for speech enhancement : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
Speech enhancement, aiming at improving the intelligibility and overall perceptual quality of a contaminated speech signal, is an effective way to improve speech communications. In this thesis, we propose three novel deep learning methods to improve speech enhancement performance.
Firstly, we propose an adversarial latent representation learning for latent space exploration of generative adversarial network based speech enhancement. Based on adversarial feature learning, this method employs an extra encoder to learn an inverse mapping from the generated data distribution to the latent space. The encoder establishes an inner connection with the generator and contributes to latent information learning.
Secondly, we propose an adversarial multi-task learning with inverse mappings method for effective speech representation. This speech enhancement method focuses on enhancing the generator's capability of speech information capture and representation learning. To implement this method, two extra networks are developed to learn the inverse mappings from the generated distribution to the input data domains.
Thirdly, we propose a self-supervised learning based phone-fortified method to improve specific speech characteristics learning for speech enhancement. This method explicitly imports phonetic characteristics into a deep complex convolutional network via a contrastive predictive coding model pre-trained with self-supervised learning.
The experimental results demonstrate that the proposed methods outperform previous speech enhancement methods and achieve state-of-the-art performance in terms of speech intelligibility and overall perceptual quality
Mask-based enhancement of very noisy speech
When speech is contaminated by high levels of additive noise, both its perceptual quality and its intelligibility are reduced. Studies show that conventional approaches to speech enhancement are able to improve quality but not intelligibility. However, in recent years, algorithms that estimate a time-frequency mask from noisy speech using a supervised machine learning approach and then apply this mask to the noisy speech have been shown to be capable of improving intelligibility.
The most direct way of measuring intelligibility is to carry out listening tests with human test subjects. However, in situations where listening tests are impractical and where some additional uncertainty in the results is permissible, for example during the development phase of a speech enhancer, intrusive intelligibility metrics can provide an alternative to listening tests. This thesis begins by outlining a new intrusive intelligibility metric, WSTOI, that is a development of the existing STOI metric. WSTOI improves STOI by weighting the intelligibility contributions of different time-frequency regions with an estimate of their intelligibility content. The prediction accuracies of WSTOI and STOI are compared for a range of noises and noise suppression algorithms and it is found that WSTOI outperforms STOI in all tested conditions.
The thesis then investigates the best choice of mask-estimation algorithm, target mask, and method of applying the estimated mask. A new target mask, the HSWOBM, is proposed that optimises a modified version of WSTOI with a higher frequency resolution. The HSWOBM is optimised for a stochastic noise signal to encourage a mask estimator trained on the HSWOBM to generalise better to unseen noise conditions. A high frequency resolution version of WSTOI is optimised as this gives improvements in predicted quality compared with optimising WSTOI. Of the tested approaches to target mask estimation, the best-performing approach uses a feed-forward neural network with a loss function based on WSTOI. The best-performing feature set is based on the gains produced by a classical speech enhancer and an estimate of the local voiced-speech-plus-noise to noise ratio in different time-frequency regions, which is obtained with the aid of a pitch estimator.
When the estimated target mask is applied in the conventional way, by multiplying the speech by the mask in the time-frequency domain, it can result in speech with very poor perceptual quality. The final chapter of this thesis therefore investigates alternative approaches to applying the estimated mask to the noisy speech, in order to improve both intelligibility and quality. An approach is developed that uses the mask to supply prior information about the speech presence probability to a classical speech enhancer that minimises the expected squared error in the log spectral amplitudes. The proposed end-to-end enhancer outperforms existing algorithms in terms of predicted quality and intelligibility for most noise types.Open Acces
Behavioural and neural insights into the recognition and motivational salience of familiar voice identities
The majority of voices encountered in everyday life belong to people we know, such as close friends, relatives, or romantic partners. However, research to date has overlooked this type of familiarity when investigating voice identity perception. This thesis aimed to address this gap in the literature, through a detailed investigation of voice perception across different types of familiarity: personally familiar voices, famous voices, and lab-trained voices. The experimental chapters of the thesis cover two broad research topics: 1) Measuring the recognition and representation of personally familiar voice identities in comparison with labtrained identities, and 2) Investigating motivation and reward in relation to hearing personally valued voices compared with unfamiliar voice identities. In the first of these, an exploration of the extent of human voice recognition capabilities was undertaken using personally familiar voices of romantic partners. The perceptual benefits of personal familiarity for voice and speech perception were examined, as well as an investigation into how voice identity representations are formed through exposure to new voice identities. Evidence for highly robust voice representations for personally familiar voices was found in the face of perceptual challenges, which greatly exceeded those found for lab-trained voices of varying levels of familiarity. Conclusions are drawn about the relevance of the amount and type of exposure on speaker recognition, the expertise we have with certain voices, and the framing of familiarity as a continuum rather than a binary categorisation. The second topic utilised voices of famous singers and their “super-fans” as listeners to probe reward and motivational responses to hearing these valued voices, using behavioural and neuroimaging experiments. Listeners were found to work harder, as evidenced by faster reaction times, to hear their musical idol compared to less valued voices in an effort-based decision-making task, and the neural correlates of these effects are reported and examined
- …