368 research outputs found
Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting
Speech recognition is a sequence prediction problem. Besides employing
various deep learning approaches for framelevel classification, sequence-level
discriminative training has been proved to be indispensable to achieve the
state-of-the-art performance in large vocabulary continuous speech recognition
(LVCSR). However, keyword spotting (KWS), as one of the most common speech
recognition tasks, almost only benefits from frame-level deep learning due to
the difficulty of getting competing sequence hypotheses. The few studies on
sequence discriminative training for KWS are limited for fixed vocabulary or
LVCSR based methods and have not been compared to the state-of-the-art deep
learning based KWS approaches. In this paper, a sequence discriminative
training framework is proposed for both fixed vocabulary and unrestricted
acoustic KWS. Sequence discriminative training for both sequence-level
generative and discriminative models are systematically investigated. By
introducing word-independent phone lattices or non-keyword blank symbols to
construct competing hypotheses, feasible and efficient sequence discriminative
training approaches are proposed for acoustic KWS. Experiments showed that the
proposed approaches obtained consistent and significant improvement in both
fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level
deep learning based acoustic KWS methods.Comment: accepted by Speech Communication, 08/02/201
Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models
We develop streaming keyword spotting systems using a recurrent neural
network transducer (RNN-T) model: an all-neural, end-to-end trained,
sequence-to-sequence model which jointly learns acoustic and language model
components. Our models are trained to predict either phonemes or graphemes as
subword units, thus allowing us to detect arbitrary keyword phrases, without
any out-of-vocabulary words. In order to adapt the models to the requirements
of keyword spotting, we propose a novel technique which biases the RNN-T system
towards a specific keyword of interest.
Our systems are compared against a strong sequence-trained, connectionist
temporal classification (CTC) based "keyword-filler" baseline, which is
augmented with a separate phoneme language model. Overall, our RNN-T system
with the proposed biasing technique significantly improves performance over the
baseline system.Comment: To appear in Proceedings of IEEE ASRU 201
Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention
Keyword spotting (KWS) and speaker verification (SV) have been studied
independently although it is known that acoustic and speaker domains are
complementary. In this paper, we propose a multi-task network that performs KWS
and SV simultaneously to fully utilize the interrelated domain information. The
multi-task network tightly combines sub-networks aiming at performance
improvement in challenging conditions such as noisy environments,
open-vocabulary KWS, and short-duration SV, by introducing novel techniques of
connectionist temporal classification (CTC)-based soft voice activity detection
(VAD) and global query attention. Frame-level acoustic and speaker information
is integrated with phonetically originated weights so that forms a word-level
global representation. Then it is used for the aggregation of feature vectors
to generate discriminative embeddings. Our proposed approach shows 4.06% and
26.71% relative improvements in equal error rate (EER) compared to the
baselines for both tasks. We also present a visualization example and results
of ablation experiments.Comment: Accepted to Interspeech 202
Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks
Retrieving spoken content with spoken queries, or query-by- example spoken
term detection (STD), is attractive because it makes possible the matching of
signals directly on the acoustic level without transcribing them into text.
Here, we propose an end-to-end query-by-example STD model based on an
attention-based multi-hop network, whose input is a spoken query and an audio
segment containing several utterances; the output states whether the audio
segment includes the query. The model can be trained in either a supervised
scenario using labeled data, or in an unsupervised fashion. In the supervised
scenario, we find that the attention mechanism and multiple hops improve
performance, and that the attention weights indicate the time span of the
detected terms. In the unsupervised setting, the model mimics the behavior of
the existing query-by-example STD system, yielding performance comparable to
the existing system but with a lower search time complexity
Learning acoustic word embeddings with phonetically associated triplet network
Previous researches on acoustic word embeddings used in query-by-example
spoken term detection have shown remarkable performance improvements when using
a triplet network. However, the triplet network is trained using only a limited
information about acoustic similarity between words. In this paper, we propose
a novel architecture, phonetically associated triplet network (PATN), which
aims at increasing discriminative power of acoustic word embeddings by
utilizing phonetic information as well as word identity. The proposed model is
learned to minimize a combined loss function that was made by introducing a
cross entropy loss to the lower layer of LSTM-based triplet network. We
observed that the proposed method performs significantly better than the
baseline triplet network on a word discrimination task with the WSJ dataset
resulting in over 20% relative improvement in recall rate at 1.0 false alarm
per hour. Finally, we examined the generalization ability by conducting the
out-of-domain test on the RM dataset.Comment: 5 pages, 4 figures, submitted to ICASSP 201
Online Keyword Spotting with a Character-Level Recurrent Neural Network
In this paper, we propose a context-aware keyword spotting model employing a
character-level recurrent neural network (RNN) for spoken term detection in
continuous speech. The RNN is end-to-end trained with connectionist temporal
classification (CTC) to generate the probabilities of character and
word-boundary labels. There is no need for the phonetic transcription, senone
modeling, or system dictionary in training and testing. Also, keywords can
easily be added and modified by editing the text based keyword list without
retraining the RNN. Moreover, the unidirectional RNN processes an infinitely
long input audio streams without pre-segmentation and keywords are detected
with low-latency before the utterance is finished. Experimental results show
that the proposed keyword spotter significantly outperforms the deep neural
network (DNN) and hidden Markov model (HMM) based keyword-filler model even
with less computations
DONUT: CTC-based Query-by-Example Keyword Spotting
Keyword spotting--or wakeword detection--is an essential feature for
hands-free operation of modern voice-controlled devices. With such devices
becoming ubiquitous, users might want to choose a personalized custom wakeword.
In this work, we present DONUT, a CTC-based algorithm for online
query-by-example keyword spotting that enables custom wakeword detection. The
algorithm works by recording a small number of training examples from the user,
generating a set of label sequence hypotheses from these training examples, and
detecting the wakeword by aggregating the scores of all the hypotheses given a
new audio recording. Our method combines the generalization and
interpretability of CTC-based keyword spotting with the user-adaptation and
convenience of a conventional query-by-example system. DONUT has low
computational requirements and is well-suited for both learning and inference
on embedded systems without requiring private user data to be uploaded to the
cloud.Comment: Accepted to NeurIPS 2018 Workshop on Interpretability and Robustness
for Audio, Speech, and Languag
Semantic query-by-example speech search using visual grounding
A number of recent studies have started to investigate how speech systems can
be trained on untranscribed speech by leveraging accompanying images at
training time. Examples of tasks include keyword prediction and within- and
across-mode retrieval. Here we consider how such models can be used for
query-by-example (QbE) search, the task of retrieving utterances relevant to a
given spoken query. We are particularly interested in semantic QbE, where the
task is not only to retrieve utterances containing exact instances of the
query, but also utterances whose meaning is relevant to the query. We follow a
segmental QbE approach where variable-duration speech segments (queries, search
utterances) are mapped to fixed-dimensional embedding vectors. We show that a
QbE system using an embedding function trained on visually grounded speech data
outperforms a purely acoustic QbE system in terms of both exact and semantic
retrieval performance.Comment: Accepted to ICASSP 201
Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection
In this paper, we propose a deep convolutional neural network-based acoustic
word embedding system on code-switching query by example spoken term detection.
Different from previous configurations, we combine audio data in two languages
for training instead of only using one single language. We transform the
acoustic features of keyword templates and searching content to
fixed-dimensional vectors and calculate the distances between keyword segments
and searching content segments obtained in a sliding manner. An auxiliary
variability-invariant loss is also applied to training data within the same
word but different speakers. This strategy is used to prevent the extractor
from encoding undesired speaker- or accent-related information into the
acoustic word embeddings. Experimental results show that our proposed system
produces promising searching results in the code-switching test scenario. With
the increased number of templates and the employment of variability-invariant
loss, the searching performance is further enhanced
Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition
We have recently shown that deep Long Short-Term Memory (LSTM) recurrent
neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as
acoustic models for speech recognition. More recently, we have shown that the
performance of sequence trained context dependent (CD) hidden Markov model
(HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained
phone models initialized with connectionist temporal classification (CTC). In
this paper, we present techniques that further improve performance of LSTM RNN
acoustic models for large vocabulary speech recognition. We show that frame
stacking and reduced frame rate lead to more accurate models and faster
decoding. CD phone modeling leads to further improvements. We also present
initial results for LSTM RNN models outputting words directly.Comment: To be published in the INTERSPEECH 2015 proceeding
- …