2,654 research outputs found
Word Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding
Retrieval of text information from natural scene images and video frames is a
challenging task due to its inherent problems like complex character shapes,
low resolution, background noise, etc. Available OCR systems often fail to
retrieve such information in scene/video frames. Keyword spotting, an
alternative way to retrieve information, performs efficient text searching in
such scenarios. However, current word spotting techniques in scene/video images
are script-specific and they are mainly developed for Latin script. This paper
presents a novel word spotting framework using dynamic shape coding for text
retrieval in natural scene image and video frames. The framework is designed to
search query keyword from multiple scripts with the help of on-the-fly
script-wise keyword generation for the corresponding script. We have used a
two-stage word spotting approach using Hidden Markov Model (HMM) to detect the
translated keyword in a given text line by identifying the script of the line.
A novel unsupervised dynamic shape coding based scheme has been used to group
similar shape characters to avoid confusion and to improve text alignment.
Next, the hypotheses locations are verified to improve retrieval performance.
To evaluate the proposed system for searching keyword from natural scene image
and video frames, we have considered two popular Indic scripts such as Bangla
(Bengali) and Devanagari along with English. Inspired by the zone-wise
recognition approach in Indic scripts[1], zone-wise text information has been
used to improve the traditional word spotting performance in Indic scripts. For
our experiment, a dataset consisting of images of different scenes and video
frames of English, Bangla and Devanagari scripts were considered. The results
obtained showed the effectiveness of our proposed word spotting approach.Comment: Multimedia Tools and Applications, Springe
Annotation-free Learning of Deep Representations for Word Spotting using Synthetic Data and Self Labeling
Word spotting is a popular tool for supporting the first exploration of
historic, handwritten document collections. Today, the best performing methods
rely on machine learning techniques, which require a high amount of annotated
training material. As training data is usually not available in the application
scenario, annotation-free methods aim at solving the retrieval task without
representative training samples. In this work, we present an annotation-free
method that still employs machine learning techniques and therefore outperforms
other learning-free approaches. The weakly supervised training scheme relies on
a lexicon, that does not need to precisely fit the dataset. In combination with
a confidence based selection of pseudo-labeled training samples, we achieve
state-of-the-art query-by-example performances. Furthermore, our method allows
to perform query-by-string, which is usually not the case for other
annotation-free methods.Comment: Accepted to Workshop on Document Analysis Systems (DAS) 202
Fast ASR-free and almost zero-resource keyword spotting using DTW and CNNs for humanitarian monitoring
We use dynamic time warping (DTW) as supervision for training a convolutional
neural network (CNN) based keyword spotting system using a small set of spoken
isolated keywords. The aim is to allow rapid deployment of a keyword spotting
system in a new language to support urgent United Nations (UN) relief
programmes in parts of Africa where languages are extremely under-resourced and
the development of annotated speech resources is infeasible. First, we use 1920
recorded keywords (40 keyword types, 34 minutes of speech) as exemplars in a
DTW-based template matching system and apply it to untranscribed broadcast
speech. Then, we use the resulting DTW scores as targets to train a CNN on the
same unlabelled speech. In this way we use just 34 minutes of labelled speech,
but leverage a large amount of unlabelled data for training. While the
resulting CNN keyword spotter cannot match the performance of the DTW-based
system, it substantially outperforms a CNN classifier trained only on the
keywords, improving the area under the ROC curve from 0.54 to 0.64. Because our
CNN system is several orders of magnitude faster at runtime than the DTW
system, it represents the most viable keyword spotter on this extremely limited
dataset.Comment: 5 pages, 4 figures, 3 tables, accepted at Interspeech 201
JavaScript Convolutional Neural Networks for Keyword Spotting in the Browser: An Experimental Analysis
Used for simple commands recognition on devices from smart routers to mobile
phones, keyword spotting systems are everywhere. Ubiquitous as well are web
applications, which have grown in popularity and complexity over the last
decade with significant improvements in usability under cross-platform
conditions. However, despite their obvious advantage in natural language
interaction, voice-enabled web applications are still far and few between. In
this work, we attempt to bridge this gap by bringing keyword spotting
capabilities directly into the browser. To our knowledge, we are the first to
demonstrate a fully-functional implementation of convolutional neural networks
in pure JavaScript that runs in any standards-compliant browser. We also apply
network slimming, a model compression technique, to explore the
accuracy-efficiency tradeoffs, reporting latency measurements on a range of
devices and software. Overall, our robust, cross-device implementation for
keyword spotting realizes a new paradigm for serving neural network
applications, and one of our slim models reduces latency by 66% with a minimal
decrease in accuracy of 4% from 94% to 90%.Comment: 5 pages, 3 figure
Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting
Speech recognition is a sequence prediction problem. Besides employing
various deep learning approaches for framelevel classification, sequence-level
discriminative training has been proved to be indispensable to achieve the
state-of-the-art performance in large vocabulary continuous speech recognition
(LVCSR). However, keyword spotting (KWS), as one of the most common speech
recognition tasks, almost only benefits from frame-level deep learning due to
the difficulty of getting competing sequence hypotheses. The few studies on
sequence discriminative training for KWS are limited for fixed vocabulary or
LVCSR based methods and have not been compared to the state-of-the-art deep
learning based KWS approaches. In this paper, a sequence discriminative
training framework is proposed for both fixed vocabulary and unrestricted
acoustic KWS. Sequence discriminative training for both sequence-level
generative and discriminative models are systematically investigated. By
introducing word-independent phone lattices or non-keyword blank symbols to
construct competing hypotheses, feasible and efficient sequence discriminative
training approaches are proposed for acoustic KWS. Experiments showed that the
proposed approaches obtained consistent and significant improvement in both
fixed vocabulary and unrestricted KWS tasks, compared to previous frame-level
deep learning based acoustic KWS methods.Comment: accepted by Speech Communication, 08/02/201
DONUT: CTC-based Query-by-Example Keyword Spotting
Keyword spotting--or wakeword detection--is an essential feature for
hands-free operation of modern voice-controlled devices. With such devices
becoming ubiquitous, users might want to choose a personalized custom wakeword.
In this work, we present DONUT, a CTC-based algorithm for online
query-by-example keyword spotting that enables custom wakeword detection. The
algorithm works by recording a small number of training examples from the user,
generating a set of label sequence hypotheses from these training examples, and
detecting the wakeword by aggregating the scores of all the hypotheses given a
new audio recording. Our method combines the generalization and
interpretability of CTC-based keyword spotting with the user-adaptation and
convenience of a conventional query-by-example system. DONUT has low
computational requirements and is well-suited for both learning and inference
on embedded systems without requiring private user data to be uploaded to the
cloud.Comment: Accepted to NeurIPS 2018 Workshop on Interpretability and Robustness
for Audio, Speech, and Languag
Streaming Small-Footprint Keyword Spotting using Sequence-to-Sequence Models
We develop streaming keyword spotting systems using a recurrent neural
network transducer (RNN-T) model: an all-neural, end-to-end trained,
sequence-to-sequence model which jointly learns acoustic and language model
components. Our models are trained to predict either phonemes or graphemes as
subword units, thus allowing us to detect arbitrary keyword phrases, without
any out-of-vocabulary words. In order to adapt the models to the requirements
of keyword spotting, we propose a novel technique which biases the RNN-T system
towards a specific keyword of interest.
Our systems are compared against a strong sequence-trained, connectionist
temporal classification (CTC) based "keyword-filler" baseline, which is
augmented with a separate phoneme language model. Overall, our RNN-T system
with the proposed biasing technique significantly improves performance over the
baseline system.Comment: To appear in Proceedings of IEEE ASRU 201
Online Keyword Spotting with a Character-Level Recurrent Neural Network
In this paper, we propose a context-aware keyword spotting model employing a
character-level recurrent neural network (RNN) for spoken term detection in
continuous speech. The RNN is end-to-end trained with connectionist temporal
classification (CTC) to generate the probabilities of character and
word-boundary labels. There is no need for the phonetic transcription, senone
modeling, or system dictionary in training and testing. Also, keywords can
easily be added and modified by editing the text based keyword list without
retraining the RNN. Moreover, the unidirectional RNN processes an infinitely
long input audio streams without pre-segmentation and keywords are detected
with low-latency before the utterance is finished. Experimental results show
that the proposed keyword spotter significantly outperforms the deep neural
network (DNN) and hidden Markov model (HMM) based keyword-filler model even
with less computations
Domain Aware Training for Far-field Small-footprint Keyword Spotting
In this paper, we focus on the task of small-footprint keyword spotting under
the far-field scenario. Far-field environments are commonly encountered in
real-life speech applications, causing severe degradation of performance due to
room reverberation and various kinds of noises. Our baseline system is built on
the convolutional neural network trained with pooled data of both far-field and
close-talking speech. To cope with the distortions, we develop three domain
aware training systems, including the domain embedding system, the deep CORAL
system, and the multi-task learning system. These methods incorporate domain
knowledge into network training and improve the performance of the keyword
classifier on far-field conditions. Experimental results show that our proposed
methods manage to maintain the performance on the close-talking speech and
achieve significant improvement on the far-field test set.Comment: Submitted to INTERSPEECH 202
Zone-based Keyword Spotting in Bangla and Devanagari Documents
In this paper we present a word spotting system in text lines for offline
Indic scripts such as Bangla (Bengali) and Devanagari. Recently, it was shown
that zone-wise recognition method improves the word recognition performance
than conventional full word recognition system in Indic scripts. Inspired with
this idea we consider the zone segmentation approach and use middle zone
information to improve the traditional word spotting performance. To avoid the
problem of zone segmentation using heuristic approach, we propose here an HMM
based approach to segment the upper and lower zone components from the text
line images. The candidate keywords are searched from a line without segmenting
characters or words. Also, we propose a novel feature combining foreground and
background information of text line images for keyword-spotting by character
filler models. A significant improvement in performance is noted by using both
foreground and background information than their individual one. Pyramid
Histogram of Oriented Gradient (PHOG) feature has been used in our word
spotting framework. From the experiment, it has been noted that the proposed
zone-segmentation based system outperforms traditional approaches of word
spotting.Comment: Preprint Submitte
- …