3,311 research outputs found
A hypothesize-and-verify framework for Text Recognition using Deep Recurrent Neural Networks
Deep LSTM is an ideal candidate for text recognition. However text
recognition involves some initial image processing steps like segmentation of
lines and words which can induce error to the recognition system. Without
segmentation, learning very long range context is difficult and becomes
computationally intractable. Therefore, alternative soft decisions are needed
at the pre-processing level. This paper proposes a hybrid text recognizer using
a deep recurrent neural network with multiple layers of abstraction and long
range context along with a language model to verify the performance of the deep
neural network. In this paper we construct a multi-hypotheses tree architecture
with candidate segments of line sequences from different segmentation
algorithms at its different branches. The deep neural network is trained on
perfectly segmented data and tests each of the candidate segments, generating
unicode sequences. In the verification step, these unicode sequences are
validated using a sub-string match with the language model and best first
search is used to find the best possible combination of alternative hypothesis
from the tree structure. Thus the verification framework using language models
eliminates wrong segmentation outputs and filters recognition errors
Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images
We address the problem of fine-grained action localization from temporally
untrimmed web videos. We assume that only weak video-level annotations are
available for training. The goal is to use these weak labels to identify
temporal segments corresponding to the actions, and learn models that
generalize to unconstrained web videos. We find that web images queried by
action names serve as well-localized highlights for many actions, but are
noisily labeled. To solve this problem, we propose a simple yet effective
method that takes weak video labels and noisy image labels as input, and
generates localized action frames as output. This is achieved by cross-domain
transfer between video frames and web images, using pre-trained deep
convolutional neural networks. We then use the localized action frames to train
action recognition models with long short-term memory networks. We collect a
fine-grained sports action data set FGA-240 of more than 130,000 YouTube
videos. It has 240 fine-grained actions under 85 sports activities. Convincing
results are shown on the FGA-240 data set, as well as the THUMOS 2014
localization data set with untrimmed training videos.Comment: Camera ready version for ACM Multimedia 201
Show and Tell: A Neural Image Caption Generator
Automatically describing the content of an image is a fundamental problem in
artificial intelligence that connects computer vision and natural language
processing. In this paper, we present a generative model based on a deep
recurrent architecture that combines recent advances in computer vision and
machine translation and that can be used to generate natural sentences
describing an image. The model is trained to maximize the likelihood of the
target description sentence given the training image. Experiments on several
datasets show the accuracy of the model and the fluency of the language it
learns solely from image descriptions. Our model is often quite accurate, which
we verify both qualitatively and quantitatively. For instance, while the
current state-of-the-art BLEU-1 score (the higher the better) on the Pascal
dataset is 25, our approach yields 59, to be compared to human performance
around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66,
and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we
achieve a BLEU-4 of 27.7, which is the current state-of-the-art
- …