181 research outputs found
EfficientWord-Net: An Open Source Hotword Detection Engine based on One-shot Learning
Voice assistants like Siri, Google Assistant, Alexa etc. are used widely
across the globe for home automation, these require the use of special phrases
also known as hotwords to wake it up and perform an action like "Hey Alexa!",
"Ok Google!" and "Hey Siri!" etc. These hotwords are detected with lightweight
real-time engines whose purpose is to detect the hotwords uttered by the user.
This paper presents the design and implementation of a hotword detection engine
based on one-shot learning which detects the hotword uttered by the user in
real-time with just one or few training samples of the hotword. This approach
is efficient when compared to existing implementations because the process of
adding a new hotword in the existing systems requires enormous amounts of
positive and negative training samples and the model needs to retrain for every
hotword. This makes the existing implementations inefficient in terms of
computation and cost. The architecture proposed in this paper has achieved an
accuracy of 94.51%.Comment: 9 pages, 17 figure
Learning weakly supervised multimodal phoneme embeddings
Recent works have explored deep architectures for learning multimodal speech
representation (e.g. audio and images, articulation and audio) in a supervised
way. Here we investigate the role of combining different speech modalities,
i.e. audio and visual information representing the lips movements, in a weakly
supervised way using Siamese networks and lexical same-different side
information. In particular, we ask whether one modality can benefit from the
other to provide a richer representation for phone recognition in a weakly
supervised setting. We introduce mono-task and multi-task methods for merging
speech and visual modalities for phone recognition. The mono-task learning
consists in applying a Siamese network on the concatenation of the two
modalities, while the multi-task learning receives several different
combinations of modalities at train time. We show that multi-task learning
enhances discriminability for visual and multimodal inputs while minimally
impacting auditory inputs. Furthermore, we present a qualitative analysis of
the obtained phone embeddings, and show that cross-modal visual input can
improve the discriminability of phonological features which are visually
discernable (rounding, open/close, labial place of articulation), resulting in
representations that are closer to abstract linguistic features than those
based on audio only
- …