1 research outputs found
Knowledge Distillation for Singing Voice Detection
Singing Voice Detection (SVD) has been an active area of research in music
information retrieval (MIR). Currently, two deep neural network-based methods,
one based on CNN and the other on RNN, exist in literature that learn optimized
features for the voice detection (VD) task and achieve state-of-the-art
performance on common datasets. Both these models have a huge number of
parameters (1.4M for CNN and 65.7K for RNN) and hence not suitable for
deployment on devices like smartphones or embedded sensors with limited
capacity in terms of memory and computation power. The most popular method to
address this issue is known as knowledge distillation in deep learning
literature (in addition to model compression) where a large pretrained network
known as the teacher is used to train a smaller student network. However, to
the best of our knowledge, such methods have not been explored yet in the
domain of SVD. In this paper, efforts have been made to investigate this issue
using both conventional as well as ensemble knowledge distillation techniques.
Through extensive experimentation on the publicly available Jamendo dataset, we
show that, not only it's possible to achieve comparable accuracies with far
smaller models (upto 1000x smaller in terms of parameters), but fascinatingly,
in some cases, smaller models trained with distillation, even surpass the
current state-of-the-art models on voice detection performance.Comment: 5 pages, 3 figure