2 research outputs found
Automatic Detection of Phonological Errors in Child Speech Using Siamese Recurrent Autoencoder
Speech sound disorder (SSD) refers to the developmental disorder in which
children encounter persistent difficulties in correctly pronouncing words.
Assessment of SSD has been relying largely on trained speech and language
pathologists (SLPs). With the increasing demand for and long-lasting shortage
of SLPs, automated assessment of speech disorder becomes a highly desirable
approach to assisting clinical work. This paper describes a study on automatic
detection of phonological errors in Cantonese speech of kindergarten children,
based on a newly collected large speech corpus. The proposed approach to speech
error detection involves the use of a Siamese recurrent autoencoder, which is
trained to learn the similarity and discrepancy between phone segments in the
embedding space. Training of the model requires only speech data from typically
developing (TD) children. To distinguish disordered speech from typical one,
cosine distance between the embeddings of the test segment and the reference
segment is computed. Different model architectures and training strategies are
experimented. Results on detecting the 6 most common consonant errors
demonstrate satisfactory performance of the proposed model, with the average
precision value from 0.82 to 0.93.Comment: Accepted to INTERSPEECH 2020, Shanghai, Chin
CNN-based Spoken Term Detection and Localization without Dynamic Programming
In this paper, we propose a spoken term detection algorithm for simultaneous
prediction and localization of in-vocabulary and out-of-vocabulary terms within
an audio segment. The proposed algorithm infers whether a term was uttered
within a given speech signal or not by predicting the word embeddings of
various parts of the speech signal and comparing them to the word embedding of
the desired term. The algorithm utilizes an existing embedding space for this
task and does not need to train a task-specific embedding space. At inference
the algorithm simultaneously predicts all possible locations of the target term
and does not need dynamic programming for optimal search. We evaluate our
system on several spoken term detection tasks on read speech corpora