Spatial transformer network has been used in a layered form in conjunction
with a convolutional network to enable the model to transform data spatially.
In this paper, we propose a combined spatial transformer network (STN) and a
Long Short-Term Memory network (LSTM) to classify digits in sequences formed by
MINST elements. This LSTM-STN model has a top-down attention mechanism profit
from LSTM layer, so that the STN layer can perform short-term independent
elements for the statement in the process of spatial transformation, thus
avoiding the distortion that may be caused when the entire sequence is
spatially transformed. It also avoids the influence of this distortion on the
subsequent classification process using convolutional neural networks and
achieves a single digit error of 1.6\% compared with 2.2\% of Convolutional
Neural Network with STN layer