1 research outputs found
Learning Efficient Representations for Keyword Spotting with Triplet Loss
In the past few years, triplet loss-based metric embeddings have become a
de-facto standard for several important computer vision problems, most
no-tably, person reidentification. On the other hand, in the area of speech
recognition the metric embeddings generated by the triplet loss are rarely used
even for classification problems. We fill this gap showing that a combination
of two representation learning techniques: a triplet loss-based embedding and a
variant of kNN for classification instead of cross-entropy loss significantly
(by 26% to 38%) improves the classification accuracy for convolutional networks
on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel
phonetic similarity based triplet mining approach. We also improve the current
best published SOTA for Google Speech Commands dataset V1 10+2 -class
classification by about 34%, achieving 98.55% accuracy, V2 10+2-class
classification by about 20%, achieving 98.37% accuracy, and V2 35-class
classification by over 50%, achieving 97.0% accuracy.Comment: Submitted to SPECOM 202