121,050 research outputs found
Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation
Speech phase prediction, which is a significant research focus in the field
of signal processing, aims to recover speech phase spectra from
amplitude-related features. However, existing speech phase prediction methods
are constrained to recovering phase spectra with short frame shifts, which are
considerably smaller than the theoretical upper bound required for exact
waveform reconstruction of short-time Fourier transform (STFT). To tackle this
issue, we present a novel long-frame-shift neural speech phase prediction
(LFS-NSPP) method which enables precise prediction of long-frame-shift phase
spectra from long-frame-shift log amplitude spectra. The proposed method
consists of three stages: interpolation, prediction and decimation. The
short-frame-shift log amplitude spectra are first constructed from
long-frame-shift ones through frequency-by-frequency interpolation to enhance
the spectral continuity, and then employed to predict short-frame-shift phase
spectra using an NSPP model, thereby compensating for interpolation errors.
Ultimately, the long-frame-shift phase spectra are obtained from
short-frame-shift ones through frame-by-frame decimation. Experimental results
show that the proposed LFS-NSPP method can yield superior quality in predicting
long-frame-shift phase spectra than the original NSPP model and other
signal-processing-based phase estimation algorithms.Comment: Published at IEEE Signal Processing Letter
Prediction of speech intelligibility based on a correlation metric in the envelope power spectrum domain
A powerful tool to investigate speech perception is the use of speech intelligibility prediction models. Recently, a model was presented, termed correlation-based speechbased envelope power spectrum model (sEPSMcorr) [1], based on the auditory processing of the multi-resolution speech-based Envelope Power Spectrum Model (mr-sEPSM) [2], combined with the correlation back-end of the Short-Time Objective Intelligibility measure (STOI) [3]. The sEPSMcorr can accurately predict NH data for a broad range of listening conditions, e.g., additive noise, phase jitter and ideal binary mask processing
Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals
In this paper, we propose a new method for the accurate estimation and
tracking of formants in speech signals using time-varying quasi-closed-phase
(TVQCP) analysis. Conventional formant tracking methods typically adopt a
two-stage estimate-and-track strategy wherein an initial set of formant
candidates are estimated using short-time analysis (e.g., 10--50 ms), followed
by a tracking stage based on dynamic programming or a linear state-space model.
One of the main disadvantages of these approaches is that the tracking stage,
however good it may be, cannot improve upon the formant estimation accuracy of
the first stage. The proposed TVQCP method provides a single-stage formant
tracking that combines the estimation and tracking stages into one. TVQCP
analysis combines three approaches to improve formant estimation and tracking:
(1) it uses temporally weighted quasi-closed-phase analysis to derive
closed-phase estimates of the vocal tract with reduced interference from the
excitation source, (2) it increases the residual sparsity by using the
optimization and (3) it uses time-varying linear prediction analysis over long
time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal
tract model and hence on the formant trajectories. Formant tracking experiments
with a wide variety of synthetic and natural speech signals show that the
proposed TVQCP method performs better than conventional and popular formant
tracking tools, such as Wavesurfer and Praat (based on dynamic programming),
the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on
deep neural networks trained in a supervised manner). Matlab scripts for the
proposed method can be found at: https://github.com/njaygowda/ftrac
Extraction of vocal-tract system characteristics from speechsignals
We propose methods to track natural variations in the characteristics of the vocal-tract system from speech signals. We are especially interested in the cases where these characteristics vary over time, as happens in dynamic sounds such as consonant-vowel transitions. We show that the selection of appropriate analysis segments is crucial in these methods, and we propose a selection based on estimated instants of significant excitation. These instants are obtained by a method based on the average group-delay property of minimum-phase signals. In voiced speech, they correspond to the instants of glottal closure. The vocal-tract system is characterized by its formant parameters, which are extracted from the analysis segments. Because the segments are always at the same relative position in each pitch period, in voiced speech the extracted formants are consistent across successive pitch periods. We demonstrate the results of the analysis for several difficult cases of speech signals
- …