We integrate the recently proposed spatial transformer network (SPN)
[Jaderberg et. al 2015] into a recurrent neural network (RNN) to form an
RNN-SPN model. We use the RNN-SPN to classify digits in cluttered MNIST
sequences. The proposed model achieves a single digit error of 1.5% compared to
2.9% for a convolutional networks and 2.0% for convolutional networks with SPN
layers. The SPN outputs a zoomed, rotated and skewed version of the input
image. We investigate different down-sampling factors (ratio of pixel in input
and output) for the SPN and show that the RNN-SPN model is able to down-sample
the input images without deteriorating performance. The down-sampling in
RNN-SPN can be thought of as adaptive down-sampling that minimizes the
information loss in the regions of interest. We attribute the superior
performance of the RNN-SPN to the fact that it can attend to a sequence of
regions of interest