1 research outputs found
Harmonic-aligned Frame Mask Based on Non-stationary Gabor Transform with Application to Content-dependent Speaker Comparison
We propose harmonic-aligned frame mask for speech signals using
non-stationary Gabor transform (NSGT). A frame mask operates on the transfer
coefficients of a signal and consequently converts the signal into a
counterpart signal. It depicts the difference between the two signals. In
preceding studies, frame masks based on regular Gabor transform were applied to
single-note instrumental sound analysis. This study extends the frame mask
approach to speech signals. For voiced speech, the fundamental frequency is
usually changing consecutively over time. We employ NSGT with pitch-dependent
and therefore time-varying frequency resolution to attain harmonic alignment in
the transform domain and hence yield harmonic-aligned frame masks for speech
signals. We propose to apply the harmonic-aligned frame mask to
content-dependent speaker comparison. Frame masks, computed from voiced signals
of a same vowel but from different speakers, were utilized as similarity
measures to compare and distinguish the speaker identities (SID). Results
obtained with deep neural networks demonstrate that the proposed frame mask is
valid in representing speaker characteristics and shows a potential for SID
applications in limited data scenarios.Comment: Interspeech201