Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Doi
Abstract
The way a non-native speaker pronounces the phones of a language
is an important predictor of their proficiency. In grading
spontaneous speech, the pairwise distances between generative
statistical models trained on each phone have been shown to be
powerful features. This paper presents a deep learning alternative
to model-based phone distances in the form of a tunable
Siamese network feature extractor to extract distance metrics directly
from the audio frame sequence. Features are extracted at
the phone instance level and combined to phone-level representations
using an attention mechanism. Pair-wise distances between
phone features are then projected through a feed-forward
layer to predict score. The extraction stage is initialised on either
a binary phone instance-pair classification task, or to mimic
the model-based features, then the whole system is fine-tuned
end-to-end, optimising the learning of the distance metric to
the score prediction task. This method is therefore more adaptable
and more sensitive to phone instance level phenomena. Its
performance is compared agains