1,521 research outputs found
Character-Based Handwritten Text Transcription with Attention Networks
The paper approaches the task of handwritten text recognition (HTR) with
attentional encoder-decoder networks trained on sequences of characters, rather
than words. We experiment on lines of text from popular handwriting datasets
and compare different activation functions for the attention mechanism used for
aligning image pixels and target characters. We find that softmax attention
focuses heavily on individual characters, while sigmoid attention focuses on
multiple characters at each step of the decoding. When the sequence alignment
is one-to-one, softmax attention is able to learn a more precise alignment at
each step of the decoding, whereas the alignment generated by sigmoid attention
is much less precise. When a linear function is used to obtain attention
weights, the model predicts a character by looking at the entire sequence of
characters and performs poorly because it lacks a precise alignment between the
source and target. Future research may explore HTR in natural scene images,
since the model is capable of transcribing handwritten text without the need
for producing segmentations or bounding boxes of text in images
Recurrent Calibration Network for Irregular Text Recognition
Scene text recognition has received increased attention in the research
community. Text in the wild often possesses irregular arrangements, typically
including perspective text, curved text, oriented text. Most existing methods
are hard to work well for irregular text, especially for severely distorted
text. In this paper, we propose a novel Recurrent Calibration Network (RCN) for
irregular scene text recognition. The RCN progressively calibrates the
irregular text to boost the recognition performance. By decomposing the
calibration process into multiple steps, the irregular text can be calibrated
to normal one step by step. Besides, in order to avoid the accumulation of lost
information caused by inaccurate transformation, we further design a
fiducial-point refinement structure to keep the integrity of text during the
recurrent process. Instead of the calibrated images, the coordinates of
fiducial points are tracked and refined, which implicitly models the
transformation information. Based on the refined fiducial points, we estimate
the transformation parameters and sample from the original image at each step.
In this way, the original character information is preserved until the final
transformation. Such designs lead to optimal calibration results to boost the
performance of succeeding recognition. Extensive experiments on challenging
datasets demonstrate the superiority of our method, especially on irregular
benchmarks.Comment: 10 pages, 4 figure
Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention
We present an attention-based model for end-to-end handwriting recognition.
Our system does not require any segmentation of the input paragraph. The model
is inspired by the differentiable attention models presented recently for
speech recognition, image captioning or translation. The main difference is the
covert and overt attention, implemented as a multi-dimensional LSTM network.
Our principal contribution towards handwriting recognition lies in the
automatic transcription without a prior segmentation into lines, which was
crucial in previous approaches. To the best of our knowledge this is the first
successful attempt of end-to-end multi-line handwriting recognition. We carried
out experiments on the well-known IAM Database. The results are encouraging and
bring hope to perform full paragraph transcription in the near future
Hierarchical Photo-Scene Encoder for Album Storytelling
In this paper, we propose a novel model with a hierarchical photo-scene
encoder and a reconstructor for the task of album storytelling. The photo-scene
encoder contains two sub-encoders, namely the photo and scene encoders, which
are stacked together and behave hierarchically to fully exploit the structure
information of the photos within an album. Specifically, the photo encoder
generates semantic representation for each photo while exploiting temporal
relationships among them. The scene encoder, relying on the obtained photo
representations, is responsible for detecting the scene changes and generating
scene representations. Subsequently, the decoder dynamically and attentively
summarizes the encoded photo and scene representations to generate a sequence
of album representations, based on which a story consisting of multiple
coherent sentences is generated. In order to fully extract the useful semantic
information from an album, a reconstructor is employed to reproduce the
summarized album representations based on the hidden states of the decoder. The
proposed model can be trained in an end-to-end manner, which results in an
improved performance over the state-of-the-arts on the public visual
storytelling (VIST) dataset. Ablation studies further demonstrate the
effectiveness of the proposed hierarchical photo-scene encoder and
reconstructor.Comment: 8 pages, 4 figure
A Compositional Textual Model for Recognition of Imperfect Word Images
Printed text recognition is an important problem for industrial OCR systems.
Printed text is constructed in a standard procedural fashion in most settings.
We develop a mathematical model for this process that can be applied to the
backward inference problem of text recognition from an image. Through ablation
experiments we show that this model is realistic and that a multi-task
objective setting can help to stabilize estimation of its free parameters,
enabling use of conventional deep learning methods. Furthermore, by directly
modeling the geometric perturbations of text synthesis we show that our model
can help recover missing characters from incomplete text regions, the bane of
multicomponent OCR systems, enabling recognition even when the detection
returns incomplete information
Class label autoencoder for zero-shot learning
Existing zero-shot learning (ZSL) methods usually learn a projection function
between a feature space and a semantic embedding space(text or attribute space)
in the training seen classes or testing unseen classes. However, the projection
function cannot be used between the feature space and multi-semantic embedding
spaces, which have the diversity characteristic for describing the different
semantic information of the same class. To deal with this issue, we present a
novel method to ZSL based on learning class label autoencoder (CLA). CLA can
not only build a uniform framework for adapting to multi-semantic embedding
spaces, but also construct the encoder-decoder mechanism for constraining the
bidirectional projection between the feature space and the class label space.
Moreover, CLA can jointly consider the relationship of feature classes and the
relevance of the semantic classes for improving zero-shot classification. The
CLA solution can provide both unseen class labels and the relation of the
different classes representation(feature or semantic information) that can
encode the intrinsic structure of classes. Extensive experiments demonstrate
the CLA outperforms state-of-art methods on four benchmark datasets, which are
AwA, CUB, Dogs and ImNet-2
Generating Images from Captions with Attention
Motivated by the recent progress in generative models, we introduce a model
that generates images from natural language descriptions. The proposed model
iteratively draws patches on a canvas, while attending to the relevant words in
the description. After training on Microsoft COCO, we compare our model with
several baseline generative models on image generation and retrieval tasks. We
demonstrate that our model produces higher quality samples than other
approaches and generates images with novel scene compositions corresponding to
previously unseen captions in the dataset.Comment: Published as a conference paper at ICLR 201
End-to-End Multimodal Speech Recognition
Transcription or sub-titling of open-domain videos is still a challenging
domain for Automatic Speech Recognition (ASR) due to the data's challenging
acoustics, variable signal processing and the essentially unrestricted domain
of the data. In previous work, we have shown that the visual channel --
specifically object and scene features -- can help to adapt the acoustic model
(AM) and language model (LM) of a recognizer, and we are now expanding this
work to end-to-end approaches. In the case of a Connectionist Temporal
Classification (CTC)-based approach, we retain the separation of AM and LM,
while for a sequence-to-sequence (S2S) approach, both information sources are
adapted together, in a single model. This paper also analyzes the behavior of
CTC and S2S models on noisy video data (How-To corpus), and compares it to
results on the clean Wall Street Journal (WSJ) corpus, providing insight into
the robustness of both approaches.Comment: 5 pages, 5 figures, Accepted at IEEE International Conference on
Acoustics, Speech and Signal Processing 2018 (ICASSP 2018
DAVANet: Stereo Deblurring with View Aggregation
Nowadays stereo cameras are more commonly adopted in emerging devices such as
dual-lens smartphones and unmanned aerial vehicles. However, they also suffer
from blurry images in dynamic scenes which leads to visual discomfort and
hampers further image processing. Previous works have succeeded in monocular
deblurring, yet there are few studies on deblurring for stereoscopic images. By
exploiting the two-view nature of stereo images, we propose a novel stereo
image deblurring network with Depth Awareness and View Aggregation, named
DAVANet. In our proposed network, 3D scene cues from the depth and varying
information from two views are incorporated, which help to remove complex
spatially-varying blur in dynamic scenes. Specifically, with our proposed
fusion network, we integrate the bidirectional disparities estimation and
deblurring into a unified framework. Moreover, we present a large-scale
multi-scene dataset for stereo deblurring, containing 20,637 blurry-sharp
stereo image pairs from 135 diverse sequences and their corresponding
bidirectional disparities. The experimental results on our dataset demonstrate
that DAVANet outperforms state-of-the-art methods in terms of accuracy, speed,
and model size.Comment: CVPR 2019 (Oral
- …