7,166 research outputs found
Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks
Retrieving spoken content with spoken queries, or query-by- example spoken
term detection (STD), is attractive because it makes possible the matching of
signals directly on the acoustic level without transcribing them into text.
Here, we propose an end-to-end query-by-example STD model based on an
attention-based multi-hop network, whose input is a spoken query and an audio
segment containing several utterances; the output states whether the audio
segment includes the query. The model can be trained in either a supervised
scenario using labeled data, or in an unsupervised fashion. In the supervised
scenario, we find that the attention mechanism and multiple hops improve
performance, and that the attention weights indicate the time span of the
detected terms. In the unsupervised setting, the model mimics the behavior of
the existing query-by-example STD system, yielding performance comparable to
the existing system but with a lower search time complexity
DONUT: CTC-based Query-by-Example Keyword Spotting
Keyword spotting--or wakeword detection--is an essential feature for
hands-free operation of modern voice-controlled devices. With such devices
becoming ubiquitous, users might want to choose a personalized custom wakeword.
In this work, we present DONUT, a CTC-based algorithm for online
query-by-example keyword spotting that enables custom wakeword detection. The
algorithm works by recording a small number of training examples from the user,
generating a set of label sequence hypotheses from these training examples, and
detecting the wakeword by aggregating the scores of all the hypotheses given a
new audio recording. Our method combines the generalization and
interpretability of CTC-based keyword spotting with the user-adaptation and
convenience of a conventional query-by-example system. DONUT has low
computational requirements and is well-suited for both learning and inference
on embedded systems without requiring private user data to be uploaded to the
cloud.Comment: Accepted to NeurIPS 2018 Workshop on Interpretability and Robustness
for Audio, Speech, and Languag
Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models
We develop streaming keyword spotting systems using a recurrent neural
network transducer (RNN-T) model: an all-neural, end-to-end trained,
sequence-to-sequence model which jointly learns acoustic and language model
components. Our models are trained to predict either phonemes or graphemes as
subword units, thus allowing us to detect arbitrary keyword phrases, without
any out-of-vocabulary words. In order to adapt the models to the requirements
of keyword spotting, we propose a novel technique which biases the RNN-T system
towards a specific keyword of interest.
Our systems are compared against a strong sequence-trained, connectionist
temporal classification (CTC) based "keyword-filler" baseline, which is
augmented with a separate phoneme language model. Overall, our RNN-T system
with the proposed biasing technique significantly improves performance over the
baseline system.Comment: To appear in Proceedings of IEEE ASRU 201
Semantic query-by-example speech search using visual grounding
A number of recent studies have started to investigate how speech systems can
be trained on untranscribed speech by leveraging accompanying images at
training time. Examples of tasks include keyword prediction and within- and
across-mode retrieval. Here we consider how such models can be used for
query-by-example (QbE) search, the task of retrieving utterances relevant to a
given spoken query. We are particularly interested in semantic QbE, where the
task is not only to retrieve utterances containing exact instances of the
query, but also utterances whose meaning is relevant to the query. We follow a
segmental QbE approach where variable-duration speech segments (queries, search
utterances) are mapped to fixed-dimensional embedding vectors. We show that a
QbE system using an embedding function trained on visually grounded speech data
outperforms a purely acoustic QbE system in terms of both exact and semantic
retrieval performance.Comment: Accepted to ICASSP 201
Learning acoustic word embeddings with phonetically associated triplet network
Previous researches on acoustic word embeddings used in query-by-example
spoken term detection have shown remarkable performance improvements when using
a triplet network. However, the triplet network is trained using only a limited
information about acoustic similarity between words. In this paper, we propose
a novel architecture, phonetically associated triplet network (PATN), which
aims at increasing discriminative power of acoustic word embeddings by
utilizing phonetic information as well as word identity. The proposed model is
learned to minimize a combined loss function that was made by introducing a
cross entropy loss to the lower layer of LSTM-based triplet network. We
observed that the proposed method performs significantly better than the
baseline triplet network on a word discrimination task with the WSJ dataset
resulting in over 20% relative improvement in recall rate at 1.0 false alarm
per hour. Finally, we examined the generalization ability by conducting the
out-of-domain test on the RM dataset.Comment: 5 pages, 4 figures, submitted to ICASSP 201
Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection
In this paper, we propose a deep convolutional neural network-based acoustic
word embedding system on code-switching query by example spoken term detection.
Different from previous configurations, we combine audio data in two languages
for training instead of only using one single language. We transform the
acoustic features of keyword templates and searching content to
fixed-dimensional vectors and calculate the distances between keyword segments
and searching content segments obtained in a sliding manner. An auxiliary
variability-invariant loss is also applied to training data within the same
word but different speakers. This strategy is used to prevent the extractor
from encoding undesired speaker- or accent-related information into the
acoustic word embeddings. Experimental results show that our proposed system
produces promising searching results in the code-switching test scenario. With
the increased number of templates and the employment of variability-invariant
loss, the searching performance is further enhanced
A Survey on Dialogue Systems: Recent Advances and New Frontiers
Dialogue systems have attracted more and more attention. Recent advances on
dialogue systems are overwhelmingly contributed by deep learning techniques,
which have been employed to enhance a wide range of big data applications such
as computer vision, natural language processing, and recommender systems. For
dialogue systems, deep learning can leverage a massive amount of data to learn
meaningful feature representations and response generation strategies, while
requiring a minimum amount of hand-crafting. In this article, we give an
overview to these recent advances on dialogue systems from various perspectives
and discuss some possible research directions. In particular, we generally
divide existing dialogue systems into task-oriented and non-task-oriented
models, then detail how deep learning techniques help them with representative
algorithms and finally discuss some appealing research directions that can
bring the dialogue system research into a new frontier.Comment: 13 pages. arXiv admin note: text overlap with arXiv:1703.01008 by
other author
Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection
Embedding audio signal segments into vectors with fixed dimensionality is
attractive because all following processing will be easier and more efficient,
for example modeling, classifying or indexing. Audio Word2Vec previously
proposed was shown to be able to represent audio segments for spoken words as
such vectors carrying information about the phonetic structures of the signal
segments. However, each linguistic unit (word, syllable, phoneme in text form)
corresponds to unlimited number of audio segments with vector representations
inevitably spread over the embedding space, which causes some confusion. It is
therefore desired to better cluster the audio embeddings such that those
corresponding to the same linguistic unit can be more compactly distributed. In
this paper, inspired by Siamese networks, we propose some approaches to achieve
the above goal. This includes identifying positive and negative pairs from
unlabeled data for Siamese style training, disentangling acoustic factors such
as speaker characteristics from the audio embedding, handling unbalanced data
distribution, and having the embedding processes learn from the adjacency
relationships among data points. All these can be done in an unsupervised way.
Improved performance was obtained in preliminary experiments on the LibriSpeech
data set, including clustering characteristics analysis and applications of
spoken term detection
Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring
We use dynamic time warping (DTW) as supervision for training a convolutional
neural network (CNN) based keyword spotting system using a small set of spoken
isolated keywords. The aim is to allow rapid deployment of a keyword spotting
system in a new language to support urgent United Nations (UN) relief
programmes in parts of Africa where languages are extremely under-resourced and
the development of annotated speech resources is infeasible. First, we use 1920
recorded keywords (40 keyword types, 34 minutes of speech) as exemplars in a
DTW-based template matching system and apply it to untranscribed broadcast
speech. Then, we use the resulting DTW scores as targets to train a CNN on the
same unlabelled speech. In this way we use just 34 minutes of labelled speech,
but leverage a large amount of unlabelled data for training. While the
resulting CNN keyword spotter cannot match the performance of the DTW-based
system, it substantially outperforms a CNN classifier trained only on the
keywords, improving the area under the ROC curve from 0.54 to 0.64. Because our
CNN system is several orders of magnitude faster at runtime than the DTW
system, it represents the most viable keyword spotter on this extremely limited
dataset.Comment: 5 pages, 4 figures, 3 tables, accepted at Interspeech 201
An Iterative Deep Learning Framework for Unsupervised Discovery of Speech Features and Linguistic Units with Applications on Spoken Term Detection
In this work we aim to discover high quality speech features and linguistic
units directly from unlabeled speech data in a zero resource scenario. The
results are evaluated using the metrics and corpora proposed in the Zero
Resource Speech Challenge organized at Interspeech 2015. A Multi-layered
Acoustic Tokenizer (MAT) was proposed for automatic discovery of multiple sets
of acoustic tokens from the given corpus. Each acoustic token set is specified
by a set of hyperparameters that describe the model configuration. These sets
of acoustic tokens carry different characteristics fof the given corpus and the
language behind, thus can be mutually reinforced. The multiple sets of token
labels are then used as the targets of a Multi-target Deep Neural Network
(MDNN) trained on low-level acoustic features. Bottleneck features extracted
from the MDNN are then used as the feedback input to the MAT and the MDNN
itself in the next iteration. We call this iterative deep learning framework
the Multi-layered Acoustic Tokenizing Deep Neural Network (MAT-DNN), which
generates both high quality speech features for the Track 1 of the Challenge
and acoustic tokens for the Track 2 of the Challenge. In addition, we performed
extra experiments on the same corpora on the application of query-by-example
spoken term detection. The experimental results showed the iterative deep
learning framework of MAT-DNN improved the detection performance due to better
underlying speech features and acoustic tokens.Comment: arXiv admin note: text overlap with arXiv:1506.0232
- …