245 research outputs found
Zero-shot keyword spotting for visual speech recognition in-the-wild
Visual keyword spotting (KWS) is the problem of estimating whether a text
query occurs in a given recording using only video information. This paper
focuses on visual KWS for words unseen during training, a real-world, practical
setting which so far has received no attention by the community. To this end,
we devise an end-to-end architecture comprising (a) a state-of-the-art visual
feature extractor based on spatiotemporal Residual Networks, (b) a
grapheme-to-phoneme model based on sequence-to-sequence neural networks, and
(c) a stack of recurrent neural networks which learn how to correlate visual
features with the keyword representation. Different to prior works on KWS,
which try to learn word representations merely from sequences of graphemes
(i.e. letters), we propose the use of a grapheme-to-phoneme encoder-decoder
model which learns how to map words to their pronunciation. We demonstrate that
our system obtains very promising visual-only KWS results on the challenging
LRS2 database, for keywords unseen during training. We also show that our
system outperforms a baseline which addresses KWS via automatic speech
recognition (ASR), while it drastically improves over other recently proposed
ASR-free KWS methods.Comment: Accepted at ECCV-201
Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping
In this paper, we propose a multilingual query-by-example keyword spotting
(KWS) system based on a residual neural network. The model is trained as a
classifier on a multilingual keyword dataset extracted from Common Voice
sentences and fine-tuned using circle loss. We demonstrate the generalization
ability of the model to new languages and report a mean reduction in EER of
59.2 % for previously seen and 47.9 % for unseen languages compared to a
competitive baseline. We show that the word embeddings learned by the KWS model
can be accurately predicted from the phoneme sequences using a simple LSTM
model. Our system achieves a promising accuracy for streaming keyword spotting
and keyword search on Common Voice audio using just 5 examples per keyword.
Experiments on the Hey-Snips dataset show a good performance with a false
negative rate of 5.4 % at only 0.1 false alarms per hour.Comment: Accepted to ICASSP 202
PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords
This study presents a novel zero-shot user-defined keyword spotting model
that utilizes the audio-phoneme relationship of the keyword to improve
performance. Unlike the previous approach that estimates at utterance level, we
use both utterance and phoneme level information. Our proposed method comprises
a two-stream speech encoder architecture, self-attention-based pattern
extractor, and phoneme-level detection loss for high performance in various
pronunciation environments. Based on experimental results, our proposed model
outperforms the baseline model and achieves competitive performance compared
with full-shot keyword spotting models. Our proposed model significantly
improves the EER and AUC across all datasets, including familiar words, proper
nouns, and indistinguishable pronunciations, with an average relative
improvement of 67% and 80%, respectively. The implementation code of our
proposed model is available at https://github.com/ncsoft/PhonMatchNet
Matching Latent Encoding for Audio-Text based Keyword Spotting
Using audio and text embeddings jointly for Keyword Spotting (KWS) has shown
high-quality results, but the key challenge of how to semantically align two
embeddings for multi-word keywords of different sequence lengths remains
largely unsolved. In this paper, we propose an audio-text-based end-to-end
model architecture for flexible keyword spotting (KWS), which builds upon
learned audio and text embeddings. Our architecture uses a novel dynamic
programming-based algorithm, Dynamic Sequence Partitioning (DSP), to optimally
partition the audio sequence into the same length as the word-based text
sequence using the monotonic alignment of spoken content. Our proposed model
consists of an encoder block to get audio and text embeddings, a projector
block to project individual embeddings to a common latent space, and an
audio-text aligner containing a novel DSP algorithm, which aligns the audio and
text embeddings to determine if the spoken content is the same as the text.
Experimental results show that our DSP is more effective than other
partitioning schemes, and the proposed architecture outperformed the
state-of-the-art results on the public dataset in terms of Area Under the ROC
Curve (AUC) and Equal-Error-Rate (EER) by 14.4 % and 28.9%, respectively
Sign language video retrieval with free-form textual queries
Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with textual queries: given a written query (e.g. a sentence) and a large collection of sign language videos, the objective is to find the signing video that best matches the written query. We propose to tackle this task by learning cross-modal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.This work was supported by the project PID2020-117142GB-I00, funded by MCIN/ AEI /10.13039/501100011033, ANR project CorVis ANR-21-CE23-0003- 01, and gifts from Google and Adobe. AD received support from la Caixa Foundation (ID 100010434), fellowship code LCF/BQ/IN18/11660029.Peer ReviewedObjectius de Desenvolupament Sostenible::10 - Reducció de les DesigualtatsObjectius de Desenvolupament Sostenible::10 - Reducció de les Desigualtats::10.2 - Per a 2030, potenciar i promoure la inclusió social, econòmica i política de totes les persones, independentment de l’edat, sexe, discapacitat, raça, ètnia, origen, religió, situació econòmica o altra condicióPostprint (author's final draft
Flexible Keyword Spotting based on Homogeneous Audio-Text Embedding
Spotting user-defined/flexible keywords represented in text frequently uses
an expensive text encoder for joint analysis with an audio encoder in an
embedding space, which can suffer from heterogeneous modality representation
(i.e., large mismatch) and increased complexity. In this work, we propose a
novel architecture to efficiently detect arbitrary keywords based on an
audio-compliant text encoder which inherently has homogeneous representation
with audio embedding, and it is also much smaller than a compatible text
encoder. Our text encoder converts the text to phonemes using a
grapheme-to-phoneme (G2P) model, and then to an embedding using representative
phoneme vectors, extracted from the paired audio encoder on rich speech
datasets. We further augment our method with confusable keyword generation to
develop an audio-text embedding verifier with strong discriminative power.
Experimental results show that our scheme outperforms the state-of-the-art
results on Libriphrase hard dataset, increasing Area Under the ROC Curve (AUC)
metric from 84.21% to 92.7% and reducing Equal-Error-Rate (EER) metric from
23.36% to 14.4%
iPhonMatchNet: Zero-Shot User-Defined Keyword Spotting Using Implicit Acoustic Echo Cancellation
In response to the increasing interest in human--machine communication across
various domains, this paper introduces a novel approach called iPhonMatchNet,
which addresses the challenge of barge-in scenarios, wherein user speech
overlaps with device playback audio, thereby creating a self-referencing
problem. The proposed model leverages implicit acoustic echo cancellation
(iAEC) techniques to increase the efficiency of user-defined keyword spotting
models, achieving a remarkable 95% reduction in mean absolute error with a
minimal increase in model size (0.13%) compared to the baseline model,
PhonMatchNet. We also present an efficient model structure and demonstrate its
capability to learn iAEC functionality without requiring a clean signal. The
findings of our study indicate that the proposed model achieves competitive
performance in real-world deployment conditions of smart devices.Comment: Submitted to ICASSP 202
Fully Unsupervised Training of Few-shot Keyword Spotting
For training a few-shot keyword spotting (FS-KWS) model, a large labeled
dataset containing massive target keywords has known to be essential to
generalize to arbitrary target keywords with only a few enrollment samples. To
alleviate the expensive data collection with labeling, in this paper, we
propose a novel FS-KWS system trained only on synthetic data. The proposed
system is based on metric learning enabling target keywords to be detected
using distance metrics. Exploiting the speech synthesis model that generates
speech with pseudo phonemes instead of texts, we easily obtain a large
collection of multi-view samples with the same semantics. These samples are
sufficient for training, considering metric learning does not intrinsically
necessitate labeled data. All of the components in our framework do not require
any supervision, making our method unsupervised. Experimental results on real
datasets show our proposed method is competitive even without any labeled and
real datasets.Comment: Accepted by IEEE SLT 202
- …