6,283 research outputs found
Visual Re-ranking with Natural Language Understanding for Text Spotting
Many scene text recognition approaches are based on purely visual information
and ignore the semantic relation between scene and text. In this paper, we
tackle this problem from natural language processing perspective to fill the
gap between language and vision. We propose a post-processing approach to
improve scene text recognition accuracy by using occurrence probabilities of
words (unigram language model), and the semantic correlation between scene and
text. For this, we initially rely on an off-the-shelf deep neural network,
already trained with a large amount of data, which provides a series of text
hypotheses per input image. These hypotheses are then re-ranked using word
frequencies and semantic relatedness with objects or scenes in the image. As a
result of this combination, the performance of the original network is boosted
with almost no additional cost. We validate our approach on ICDAR'17 dataset.Comment: Accepted by ACCV 2018. arXiv admin note: substantial text overlap
with arXiv:1810.0977
Visual re-ranking with natural language understanding for text spotting
The final publication is available at link.springer.comMany scene text recognition approaches are based on purely visual information and ignore the semantic relation between scene and text. In this paper, we tackle this problem from natural language processing perspective to fill the gap between language and vision. We propose a post processing approach to improve scene text recognition accuracy by using occurrence probabilities of words (unigram language model), and the semantic correlation between scene and text. For this, we initially rely on an off-the-shelf deep neural network, already trained with large amount of data, which provides a series of text hypotheses per input image. These hypotheses are then re-ranked using word frequencies and semantic relatedness with objects or scenes in the image. As a result of this combination, the performance of the original network is boosted with almost no additional cost. We validate our approach on ICDAR'17 dataset.Peer ReviewedPostprint (author's final draft
Sequence to Sequence -- Video to Text
Real-world videos often have complex dynamics; and methods for generating
open-domain video descriptions should be sensitive to temporal structure and
allow both input (sequence of frames) and output (sequence of words) of
variable length. To approach this problem, we propose a novel end-to-end
sequence-to-sequence model to generate captions for videos. For this we exploit
recurrent neural networks, specifically LSTMs, which have demonstrated
state-of-the-art performance in image caption generation. Our LSTM model is
trained on video-sentence pairs and learns to associate a sequence of video
frames to a sequence of words in order to generate a description of the event
in the video clip. Our model naturally is able to learn the temporal structure
of the sequence of frames as well as the sequence model of the generated
sentences, i.e. a language model. We evaluate several variants of our model
that exploit different visual features on a standard set of YouTube videos and
two movie description datasets (M-VAD and MPII-MD).Comment: ICCV 2015 camera-ready. Includes code, project page and LSMDC
challenge result
- …