482 research outputs found
Word Searching in Scene Image and Video Frame in Multi-Script Scenario using Dynamic Shape Coding
Retrieval of text information from natural scene images and video frames is a
challenging task due to its inherent problems like complex character shapes,
low resolution, background noise, etc. Available OCR systems often fail to
retrieve such information in scene/video frames. Keyword spotting, an
alternative way to retrieve information, performs efficient text searching in
such scenarios. However, current word spotting techniques in scene/video images
are script-specific and they are mainly developed for Latin script. This paper
presents a novel word spotting framework using dynamic shape coding for text
retrieval in natural scene image and video frames. The framework is designed to
search query keyword from multiple scripts with the help of on-the-fly
script-wise keyword generation for the corresponding script. We have used a
two-stage word spotting approach using Hidden Markov Model (HMM) to detect the
translated keyword in a given text line by identifying the script of the line.
A novel unsupervised dynamic shape coding based scheme has been used to group
similar shape characters to avoid confusion and to improve text alignment.
Next, the hypotheses locations are verified to improve retrieval performance.
To evaluate the proposed system for searching keyword from natural scene image
and video frames, we have considered two popular Indic scripts such as Bangla
(Bengali) and Devanagari along with English. Inspired by the zone-wise
recognition approach in Indic scripts[1], zone-wise text information has been
used to improve the traditional word spotting performance in Indic scripts. For
our experiment, a dataset consisting of images of different scenes and video
frames of English, Bangla and Devanagari scripts were considered. The results
obtained showed the effectiveness of our proposed word spotting approach.Comment: Multimedia Tools and Applications, Springe
Phonetic Searching
An improved method and apparatus is disclosed which uses probabilistic techniques to map an input search string with a prestored audio file, and recognize certain portions of a search string phonetically. An improved interface is disclosed which permits users to input search strings, linguistics, phonetics, or a combination of both, and also allows logic functions to be specified by indicating how far separated specific phonemes are in time.Georgia Tech Research Corporatio
Spoken command recognition for robotics
In this thesis, I investigate spoken command recognition technology for robotics. While high
robustness is expected, the distant and noisy conditions in which the system has to operate
make the task very challenging. Unlike commercial systems which all rely on a "wake-up"
word to initiate the interaction, the pipeline proposed here directly detect and recognizes
commands from the continuous audio stream. In order to keep the task manageable despite
low-resource conditions, I propose to focus on a limited set of commands, thus trading off
flexibility of the system against robustness.
Domain and speaker adaptation strategies based on a multi-task regularization paradigm
are first explored. More precisely, two different methods are proposed which rely on a tied
loss function which penalizes the distance between the output of several networks. The first
method considers each speaker or domain as a task. A canonical task-independent network is
jointly trained with task-dependent models, allowing both types of networks to improve by
learning from one another. While an improvement of 3.2% on the frame error rate (FER) of
the task-independent network is obtained, this only partially carried over to the phone error
rate (PER), with 1.5% of improvement. Similarly, a second method explored the parallel
training of the canonical network with a privileged model having access to i-vectors. This
method proved less effective with only 1.2% of improvement on the FER.
In order to make the developed technology more accessible, I also investigated the use
of a sequence-to-sequence (S2S) architecture for command classification. The use of an
attention-based encoder-decoder model reduced the classification error by 40% relative to a
strong convolutional neural network (CNN)-hidden Markov model (HMM) baseline, showing
the relevance of S2S architectures in such context. In order to improve the flexibility of the
trained system, I also explored strategies for few-shot learning, which allow to extend the
set of commands with minimum requirements in terms of data. Retraining a model on the
combination of original and new commands, I managed to achieve 40.5% of accuracy on the
new commands with only 10 examples for each of them. This scores goes up to 81.5% of
accuracy with a larger set of 100 examples per new command. An alternative strategy, based
on model adaptation achieved even better scores, with 68.8% and 88.4% of accuracy with 10
and 100 examples respectively, while being faster to train. This high performance is obtained
at the expense of the original categories though, on which the accuracy deteriorated. Those
results are very promising as the methods allow to easily extend an existing S2S model with
minimal resources.
Finally, a full spoken command recognition system (named iCubrec) has been developed
for the iCub platform. The pipeline relies on a voice activity detection (VAD) system to
propose a fully hand-free experience. By segmenting only regions that are likely to contain
commands, the VAD module also allows to reduce greatly the computational cost of the
pipeline. Command candidates are then passed to the deep neural network (DNN)-HMM
command recognition system for transcription. The VoCub dataset has been specifically
gathered to train a DNN-based acoustic model for our task. Through multi-condition training
with the CHiME4 dataset, an accuracy of 94.5% is reached on VoCub test set. A filler model,
complemented by a rejection mechanism based on a confidence score, is finally added to the
system to reject non-command speech in a live demonstration of the system
- …