74 research outputs found
Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses
This paper presents a novel speech phase prediction model which predicts
wrapped phase spectra directly from amplitude spectra by neural networks. The
proposed model is a cascade of a residual convolutional network and a parallel
estimation architecture. The parallel estimation architecture is composed of
two parallel linear convolutional layers and a phase calculation formula,
imitating the process of calculating the phase spectra from the real and
imaginary parts of complex spectra and strictly restricting the predicted phase
values to the principal value interval. To avoid the error expansion issue
caused by phase wrapping, we design anti-wrapping training losses defined
between the predicted wrapped phase spectra and natural ones by activating the
instantaneous phase error, group delay error and instantaneous angular
frequency error using an anti-wrapping function. Experimental results show that
our proposed neural speech phase prediction model outperforms the iterative
Griffin-Lim algorithm and other neural network-based method, in terms of both
reconstructed speech quality and generation speed.Comment: Accepted by ICASSP 2023. Codes are availabl
A Deep Generative Model of Speech Complex Spectrograms
This paper proposes an approach to the joint modeling of the short-time
Fourier transform magnitude and phase spectrograms with a deep generative
model. We assume that the magnitude follows a Gaussian distribution and the
phase follows a von Mises distribution. To improve the consistency of the phase
values in the time-frequency domain, we also apply the von Mises distribution
to the phase derivatives, i.e., the group delay and the instantaneous
frequency. Based on these assumptions, we explore and compare several
combinations of loss functions for training our models. Built upon the
variational autoencoder framework, our model consists of three convolutional
neural networks acting as an encoder, a magnitude decoder, and a phase decoder.
In addition to the latent variables, we propose to also condition the phase
estimation on the estimated magnitude. Evaluated for a time-domain speech
reconstruction task, our models could generate speech with a high perceptual
quality and a high intelligibility
Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation
Speech phase prediction, which is a significant research focus in the field
of signal processing, aims to recover speech phase spectra from
amplitude-related features. However, existing speech phase prediction methods
are constrained to recovering phase spectra with short frame shifts, which are
considerably smaller than the theoretical upper bound required for exact
waveform reconstruction of short-time Fourier transform (STFT). To tackle this
issue, we present a novel long-frame-shift neural speech phase prediction
(LFS-NSPP) method which enables precise prediction of long-frame-shift phase
spectra from long-frame-shift log amplitude spectra. The proposed method
consists of three stages: interpolation, prediction and decimation. The
short-frame-shift log amplitude spectra are first constructed from
long-frame-shift ones through frequency-by-frequency interpolation to enhance
the spectral continuity, and then employed to predict short-frame-shift phase
spectra using an NSPP model, thereby compensating for interpolation errors.
Ultimately, the long-frame-shift phase spectra are obtained from
short-frame-shift ones through frame-by-frame decimation. Experimental results
show that the proposed LFS-NSPP method can yield superior quality in predicting
long-frame-shift phase spectra than the original NSPP model and other
signal-processing-based phase estimation algorithms.Comment: Published at IEEE Signal Processing Letter
A Flexible Online Framework for Projection-Based STFT Phase Retrieval
Several recent contributions in the field of iterative STFT phase retrieval
have demonstrated that the performance of the classical Griffin-Lim method can
be considerably improved upon. By using the same projection operators as
Griffin-Lim, but combining them in innovative ways, these approaches achieve
better results in terms of both reconstruction quality and required number of
iterations, while retaining a similar computational complexity per iteration.
However, like Griffin-Lim, these algorithms operate in an offline manner and
thus require an entire spectrogram as input, which is an unrealistic
requirement for many real-world speech communication applications. We propose
to extend RTISI -- an existing online (frame-by-frame) variant of the
Griffin-Lim algorithm -- into a flexible framework that enables straightforward
online implementation of any algorithm based on iterative projections. We
further employ this framework to implement online variants of the fast
Griffin-Lim algorithm, the accelerated Griffin-Lim algorithm, and two
algorithms from the optics domain. Evaluation results on speech signals show
that, similarly to the offline case, these algorithms can achieve a
considerable performance gain compared to RTISI.Comment: Submitted to ICASSP 2
- …