2 research outputs found
Traditional Machine Learning for Pitch Detection
Pitch detection is a fundamental problem in speech processing as F0 is used
in a large number of applications. Recent articles have proposed deep learning
for robust pitch tracking. In this paper, we consider voicing detection as a
classification problem and F0 contour estimation as a regression problem. For
both tasks, acoustic features from multiple domains and traditional machine
learning methods are used. The discrimination power of existing and proposed
features is assessed through mutual information. Multiple supervised and
unsupervised approaches are compared. A significant relative reduction of
voicing errors over the best baseline is obtained: 20% with the best clustering
method (K-means) and 45% with a Multi-Layer Perceptron. For F0 contour
estimation, the benefits of regression techniques are limited though. We
investigate whether those objective gains translate in a parametric synthesis
task. Clear perceptual preferences are observed for the proposed approach over
two widely-used baselines (RAPT and DIO)
Pitch-Synchronous Single Frequency Filtering Spectrogram for Speech Emotion Recognition
Convolutional neural networks (CNN) are widely used for speech emotion
recognition (SER). In such cases, the short time fourier transform (STFT)
spectrogram is the most popular choice for representing speech, which is fed as
input to the CNN. However, the uncertainty principles of the short-time Fourier
transform prevent it from capturing time and frequency resolutions
simultaneously. On the other hand, the recently proposed single frequency
filtering (SFF) spectrogram promises to be a better alternative because it
captures both time and frequency resolutions simultaneously. In this work, we
explore the SFF spectrogram as an alternative representation of speech for SER.
We have modified the SFF spectrogram by taking the average of the amplitudes of
all the samples between two successive glottal closure instants (GCI)
locations. The duration between two successive GCI locations gives the pitch,
motivating us to name the modified SFF spectrogram as pitch-synchronous SFF
spectrogram. The GCI locations were detected using zero frequency filtering
approach. The proposed pitch-synchronous SFF spectrogram produced accuracy
values of 63.95% (unweighted) and 70.4% (weighted) on the IEMOCAP dataset.
These correspond to an improvement of +7.35% (unweighted) and +4.3% (weighted)
over state-of-the-art result on the STFT sepctrogram using CNN. Specially, the
proposed method recognized 22.7% of the happy emotion samples correctly,
whereas this number was 0% for state-of-the-art results. These results also
promise a much wider use of the proposed pitch-synchronous SFF spectrogram for
other speech-based applications.Comment: 11 pages and less than 20 figure