27 research outputs found
MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition
We present an MatchboxNet - an end-to-end neural network for speech command
recognition. MatchboxNet is a deep residual network composed from blocks of 1D
time-channel separable convolution, batch-normalization, ReLU and dropout
layers. MatchboxNet reaches state-of-the-art accuracy on the Google Speech
Commands dataset while having significantly fewer parameters than similar
models. The small footprint of MatchboxNet makes it an attractive candidate for
devices with limited computational resources. The model is highly scalable, so
model accuracy can be improved with modest additional memory and compute.
Finally, we show how intensive data augmentation using an auxiliary noise
dataset improves robustness in the presence of background noise
Few-Shot Keyword Spotting With Prototypical Networks
Recognizing a particular command or a keyword, keyword spotting has been
widely used in many voice interfaces such as Amazon's Alexa and Google Home. In
order to recognize a set of keywords, most of the recent deep learning based
approaches use a neural network trained with a large number of samples to
identify certain pre-defined keywords. This restricts the system from
recognizing new, user-defined keywords. Therefore, we first formulate this
problem as a few-shot keyword spotting and approach it using metric learning.
To enable this research, we also synthesize and publish a Few-shot Google
Speech Commands dataset. We then propose a solution to the few-shot keyword
spotting problem using temporal and dilated convolutions on prototypical
networks. Our comparative experimental results demonstrate keyword spotting of
new keywords using just a small number of samples
LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting
This paper proposes a hardware-efficient architecture, Linearized Convolution
Network (LiCo-Net) for keyword spotting. It is optimized specifically for
low-power processor units like microcontrollers. ML operators exhibit
heterogeneous efficiency profiles on power-efficient hardware. Given the exact
theoretical computation cost, int8 operators are more computation-effective
than float operators, and linear layers are often more efficient than other
layers. The proposed LiCo-Net is a dual-phase system that uses the efficient
int8 linear operators at the inference phase and applies streaming convolutions
at the training phase to maintain a high model capacity. The experimental
results show that LiCo-Net outperforms single-value decomposition filter (SVDF)
on hardware efficiency with on-par detection performance. Compared to SVDF,
LiCo-Net reduces cycles by 40% on HiFi4 DSP
AraSpot: Arabic Spoken Command Spotting
Spoken keyword spotting (KWS) is the task of identifying a keyword in an
audio stream and is widely used in smart devices at the edge in order to
activate voice assistants and perform hands-free tasks. The task is daunting as
there is a need, on the one hand, to achieve high accuracy while at the same
time ensuring that such systems continue to run efficiently on low power and
possibly limited computational capabilities devices. This work presents AraSpot
for Arabic keyword spotting trained on 40 Arabic keywords, using different
online data augmentation, and introducing ConformerGRU model architecture.
Finally, we further improve the performance of the model by training a
text-to-speech model for synthetic data generation. AraSpot achieved a
State-of-the-Art SOTA 99.59% result outperforming previous approaches.Comment: A preprin