We integrate the recently proposed spatial transformer network (SPN)
[Jaderberg et. al 2015] into a recurrent neural network (RNN) to form an
RNN-SPN model. We use the RNN-SPN to classify digits in cluttered MNIST
sequences. The proposed model achieves a single digit error of 1.5% compared to
2.9% for a convolutional networks and 2.0% for convolutional networks with SPN
layers. The SPN outputs a zoomed, rotated and skewed version of the input
image. We investigate different down-sampling factors (ratio of pixel in input
and output) for the SPN and show that the RNN-SPN model is able to down-sample
the input images without deteriorating performance. The down-sampling in
RNN-SPN can be thought of as adaptive down-sampling that minimizes the
information loss in the regions of interest. We attribute the superior
performance of the RNN-SPN to the fact that it can attend to a sequence of
regions of interest

Maaløe, Lars

Sønderby, Casper Kaae

Sønderby, Søren Kaae

Winther, Ole

English

arXiv

We integrate the recently proposed spatial transformer network (SPN) [Jaderberg et. al 2015] into a recurrent neural network (RNN) to form an RNN-SPN model. We use the RNN-SPN to classify digits in cluttered MNIST sequences. The proposed model achieves a single digit error of 1.5% compared to 2.9% for a convolutional networks and 2.0% for convolutional networks with SPN layers. The SPN outputs a zoomed, rotated and skewed version of the input image. We investigate different down-sampling factors (ratio of pixel in input and output) for the SPN and show that the RNN-SPN model is able to down-sample the input images without deteriorating performance. The down-sampling in RNN-SPN can be thought of as adaptive down-sampling that minimizes the information loss in the regions of interest. We attribute the superior performance of the RNN-SPN to the fact that it can attend to a sequence of regions of interest

Online Research Database In Technology

                       General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.  • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal   If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.     Downloaded from orbit.dtu.dk on: Dec 21, 2017Recurrent Spatial Transformer NetworksSønderby, Søren Kaae; Sønderby, Casper Kaae; Maaløe, Lars; Winther, OlePublished in:arXivPublication date:2015Document VersionPublisher's PDF, also known as Version of recordLink back to DTU OrbitCitation (APA):Sønderby, S. K., Sønderby, C. K., Maaløe, L., & Winther, O. (2015). Recurrent Spatial Transformer Networks.arXiv, [arXiv:1509.05329].arXiv:1509.05329v1  [cs.CV]  17 Sep 2015Recurrent Spatial Transformer NetworksSøren Kaae Sønderby1 SKAAESONDERBY@GMAIL.COMCasper Kaae Sønderby1 CASPERKAAE@GMAIL.COMLars Maaløe2 LARSMA@DTU.DKOle Winther1,2 OLWI@DTU.DK1 Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark2 Department for Applied Mathematics and Computer Science, Technical University of Denmark, 2800 Lyngby, DenmarkAbstractWe integrate the recently proposed spatial trans-former network (SPN) (Jaderberg & Simonyan,2015) into a recurrent neural network (RNN) toform an RNN-SPN model. We use the RNN-SPN to classify digits in cluttered MNIST se-quences. The proposed model achieves a singledigit error of 1.5% compared to 2.9% for a con-volutional networks and 2.0% for convolutionalnetworks with SPN layers. The SPN outputs azoomed, rotated and skewed version of the inputimage. We investigate different down-samplingfactors (ratio of pixel in input and output) forthe SPN and show that the RNN-SPN model isable to down-sample the input images without de-teriorating performance. The down-sampling inRNN-SPN can be thought of as adaptive down-sampling that minimizes the information loss inthe regions of interest. We attribute the superiorperformance of the RNN-SPN to the fact that itcan attend to a sequence of regions of interest.1. IntroductionAttention mechanisms have been used for machinetranslation (Bahdanau et al., 2014), speech recognition(Chorowski et al., 2014) and image recognition (Ba et al.,2014; Gregor et al., 2015; Xu et al., 2015). The re-cently proposed spatial transformer network (SPN)(Jaderberg & Simonyan, 2015) is a new method for incor-porating spatial attention in neural networks. SPN uses alearned affine transformation of the input and bilinear inter-polation to produce its output. This allows the SPN net-work to zoom, rotate and skew the input. A SPN layercan be used as any other layer in a feed-forward convolu-tional network1. The feed-forward SPN (FFN-SPN) is illus-trated in Figure 1 panel a) where a SPN combined with a1See e.g. https://goo.gl/1Ho1M6convolutional network is used to predict a sequence of dig-its. Because the predictions are made with n independentsoftmax layers at the top of the network, the SPN layersmust find the box containing the entire sequence. A bettermodel would sequentially produce the targets by locatingand zoom in on each individual element before each pre-diction. In this paper we combine the SPN network with arecurrent neural network (RNN) to create a recurrent SPN(RNN-SPN) that sequentially zooms in on each element.This is illustrated in Figure 1 panel b). In the SPN-RNNa RNN network produces inputs for the SPN at each time-step. Using these inputs the SPN produces a transformedversion of the input image which is then used for classi-fication. By running the recursion for multiple steps andconditioning each transformation on the previous time-stepthe RNN-SPN model can sequentially attend to the part ofthe image containing each elements of interest and only usethe relevant information for classification. Because theseregions are generally small we experiment with forcing theRNN-SPN to down-sample the image which can thought ofas adaptive down-sampling thus keeps the resolution of theregions of interest (nearly) constant.2. Related WorkGregor et al. 2015 introduced a differentiable attentionmechanism based on an array of Gaussians and combinedit with a RNN for both generative and discriminativetasks. Ba et al. 2014 and Sermanet et al. combined a non-differentiable attention mechanism with an RNN and usedit for classification. Their attention mechanism was trainedusing reinforcement learning. Other related work include(Xu et al., 2015) who applies visual attention in an encoder-decoder structure.3. Spatial Transformer NetworkThe SPN network is implemented similarly to(Jaderberg & Simonyan, 2015). A SPN network takes anTransformer networkFigure 1. SPN networks for predicting the sequence 723 from an image. a) A FFN-SPN network attend to the entire sequence (blue box).The digits are classified by three separate softmax layers at the top of the network. Because the network cannot zoom in on individualdigits in the sequence all digits are classified from the same image crop. b) A RNN-SPN where the transformation is predicted with anRNN. This allows the model to create a separate crop for each digit. Each crop (indicated with blue numbers) is then passed through thesame classification network. The structure enables the model to zoom in on each individual digit.image or a feature map from a convolutional network asinput. An affine transformation and bilinear interpolationis then applied to the input to produce the output of theSPN. The affine transformation allows zoom, rotation andskew of the input. The parameters of the transformationare predicted using a localization network floc:floc(I) = Aθ =[θ1 θ2 θ3θ4 θ5 θ6], (1)where I is the input to the SPN with shape [H ×W × C](height, width, channels) and matrix Aθ specifies the affinetransformation. The affine transformation is applied on amesh grid G ∈ Rh×w:G ={(y1, x1), (y1, x2), ...(y2, x1), (y2, x2),...(yw, xh−1), (yh, xw)}. (2)where w and h does not need to be equal to H and W . Gis illustrated in Figure 2 panel a). The grid is laid out suchthat (y1, x1) = (−1,−1) and (yh, xw) = (+1,+1) withthe points in-between spaced equally. The affine transfor-mation is applied to G to produce an image S which tellshow to select points from I and map them back onto G:Sij = Aθyixj1 (3)where we have augmented each point with a 1. Since themapped points in S does not correspond exactly to onepixel in I bilinear interpolation is used to interpolate eachpoint in S. The sub-gradients for the bilinear interpola-tion are defined and we can use standard backpropagationthrough the transformation to learn the parameters in floc.The sampling process is illustrated in Figure 2 where panela) illustrates G and panel b) illustrates S after we have ap-plied the transformation.3.1. Down-samplingWe can vary the number of sampled points by varying h andw. Having h and w less than H and W will downsamplethe input to the SPN. We specify the down sampling withd. d larger than 1 will downsample the input. The numberof sampled points form I is:npoints =(Hd)·(Wd)=H ·Wd2. (4)For 2D images the sampled points decrease quadraticallywith d.3.2. RNN-SPNIn the original FFN-SPN the localization network is a feed-forward convolutional neural network. We modify thismodel by letting an RNN predict the transformation ma-trices such thatc = fconv(I) (5)ht = frnnloc (c, ht−1) (6)Aθ = g(ht). (7)where fconv is a convolutional network taking I as input andcreating a feature map c, f rnnloc is an RNN, and g is a FFN.Here an affine transformation is produced at each time-stepfrom the hidden state of the RNN. Importantly the affinetransformations are conditioned on the previous transfor-mations through the time dependency of the RNN.4. ExperimentsWe test the model on a dataset of sequences of MNIST dig-its cluttered with noise. The dataset was created by placing3 random MNIST digits on a canvas of size 100 × 100 pix-els. The first digits was placed by randomly sampling anTransformer networkFigure 2. a) The sampling grid G of equally spaced sampling points. We set y1 and x1 to −1 and yh, xw to +1. b) A recurrent SPNis able to zoom in on each element in the sequence. c) Bilinear transformation will interpolate the red cross by calculating a weightedaverage of the four nearest pixels. The operation is differentiable.y position on the canvas. The x positions were randomlysampled subject to the entire sequence must fit the canvasand the digits are non-overlapping. Subsequent digits areplaced by following a slope sampled from ±45◦. Finallythe images are cluttered by randomly placing 8 patches ofsize 9 × 9 pixels sampled from the original MNIST digits.For the test, validation and training sets we sample fromthe corresponding set in the original MNIST dataset. Wecreate 60000 examples for training, 10000 for validation,and 10000 for testing2. Figure 3 shows examples of thegenerated sequences.As a baseline model we trained a FNN-SPN with the SPNlayer following immediately after the input. The classifica-tion network had 4 layers of conv-maxpool-dropout layersfollowed by a fully connected layer with 400 units and fi-nally a separate softmax layer for each position in the se-quence. The convolutional layers had 96 filters with size3 × 3 and rectified linear units were used for nonlinearityin both the convolutional and fully connected layers. Forcomparison we further train a purely convolutional networksimilar to the classification network used in the FFN-SPN.The RNN-SPN use a gated recurrent unit (GRU)(Chung et al., 2014) with 256 units. The GRU is run for3 time steps. At each time step the GRU unit use c as input.We apply a linear layer to convert ht into Atθ. The RNN-SPN is followed by a classification convolutional networksimilar to the network used in the FFN-SPN model, exceptthat the convolutional layers only have 32 filters.In all experiments, the localization networks had 3 layers ofmax-pooling convolutional layers. All convolutional layershad 20 filters with size [3 × 3]. All models were trainedwith RMSprop (Tieleman & Hinton, 2012) down-samplingfactors and dropout rates optimized on the validation set.A complete description of the models can be found in the2The script for generating the dataset is available along withthe rest of the code.Table 1. Per digit error rates on MNIST sequence dataset, d is thedown-sampling factor.Cluttered MNIST SequencesModel Err. (%)RNN-SPN d=1 1.8RNN-SPN d=2 1.5RNN-SPN d=3 1.8RNN-SPN d=4 2.3FFN-SPN d=1 4.4FFN-SPN d=2 2.0FFN-SPN d=3 2.9FFN-SPN d=4 5.3Conv. net. 2.9Appendix.The models were implemented using Theano(Bastien et al., 2012) and Lasagne (Dieleman et al.,2015). The SPN has been merged into the Lasagnelibrary3. Code for models and dataset is released athttps://goo.gl/RspkZy.5. ResultsTable 1 reports the per digit error rates for the tested models.The RNN-SPN models perform better than both convolu-tional networks (2.9%) and FFN-SPN networks (2.0%). InFigure 3 we show where the model attend on three samplesequences from the test set. The last three columns showimage crops after the affine transformation using a down-sampling factor of three. We found that increasing thedown-sampling factor above one encouraged the model tozoom. When the down-sampling factor is greater than onewe introduce an information bottleneck forcing the modelto zoom in on each digit. The poor performance of the FFN-3Available here: http://goo.gl/kgSk0tTransformer networkFigure 3. The left column shows three examples of the generated cluttered MNIST sequences. The next column shows where the modelattend when classifying each digit. In the last 3 column we show the image crops that the RNN-SPN uses to classify each digit. Theinput sequences are 100× 100 pixels. Each image crop is 33× 33 pixels because the model uses a down-sample factor of 3.SPN convolutional net for high down-sampling values isexplained by the effective decrease in resolution since themodel needs to fit all three digits in the image crop.6. ConclusionWe have shown that the SPN can be combined with an RNNto classify sequences. Combining RNN and SPN createsa model that performs better than FFN-SPN for classify-ing sequences. The RNN-SPN model is able to attend toeach individual element in a sequence, something that theFFN-SPN network cannot do. The main advantage of theRNN-SPN model when compared to the DRAW network(Gregor et al., 2015) is that the SPN attention is faster totrain. Compared with the model of (Mnih et al., 2014) ourmodel is end-to-end trainable with backpropagation. In thiswork we have implemented a simple RNN-SPN model fu-ture work include allowing multiple glimpses per digit andusing the current glimpse as input to the RNN network.ReferencesBa, Jimmy, Mnih, Volodymyr, and Kavukcuoglu, Koray.Multiple Object Recognition with Visual Attention. De-cember 2014.Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua.Neural Machine Translation by Jointly Learning to Alignand Translate. September 2014.Bastien, Fre´de´ric, Lamblin, Pascal, Pascanu, Razvan,Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud,Bouchard, Nicolas, Warde-Farley, David, and Bengio,Yoshua. Theano: new features and speed improvements.arXiv preprint arXiv:1211.5590, November 2012.Chorowski, Jan, Bahdanau, Dzmitry, Cho, Kyunghyun, andBengio, Yoshua. End-to-end Continuous Speech Recog-nition using Attention-based Recurrent NN: First Re-sults. December 2014.Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun,and Bengio, Yoshua. Empirical Evaluation of Gated Re-current Neural Networks on Sequence Modeling. arXivpreprint arXiv:1412.3555, December 2014.Dieleman, Sander, Schlu¨ter, Jan, Raffel, Colin, Ol-son, Eben, Sønderby, Søren Kaae, Nouri, Daniel,Battenberg, Eric, van den Oord, Aa¨ron, and othercontributors. Lasagne: First release., August 2015. URLhttp://dx.doi.org/10.5281/zenodo.27878.Gregor, Karol, Danihelka, Ivo, Graves, Alex, and Wierstra,Daan. DRAW: A Recurrent Neural Network For ImageGeneration. arXiv preprint arXiv:1502.04623, 2015.Jaderberg, M and Simonyan, K. Spatial Transformer Net-works. arXiv preprint arXiv: 1506.02025, 2015.Mnih, Volodymyr, Heess, Nicolas, Graves, Alex, andKavukcuoglu, Koray. Recurrent Models of Visual At-tention. June 2014.Sermanet, Pierre, Frome, Andrea, and Real, Esteban. At-tention for Fine-Grained Categorization.Tieleman, T and Hinton, Geoffrey E. {Lecture 6.5—RmsProp: Divide the gradient by a running average ofits recent magnitude}, 2012.Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,Courville, Aaron, Salakhutdinov, Ruslan, Zemel,Richard, and Bengio, Yoshua. Show, Attend and Tell:Transformer networkNeural Image Caption Generation with Visual Attention.February 2015.Appendix6.1. RNN-SPNLocalisation network1. maxpool(2,2)2. conv(20@(3,3))3. maxpool(2,2)4. conv(20@(3,3))5. maxpool(2,2)6. conv(20@(3,3))7. GRU(units=256)8. Denselayer(6, linear output)Classification network1. SPATIAL TRANSFORMER LAYER (downsample = 1, 3,4)2. conv(32@(3,3))3. maxpool(2,2)4. dropout(p)5. conv(32@(3,3))6. maxpool(2,2)7. dropout(p)8. conv(32@(3,3))9. dropout(p)10. Dense(256)11. softmaxIn table Table 1 the model are reported at RNN SPN withan entry for each tested downsample factor.6.2. Baseline modelsFFN-SPN model1. SPATIAL TRANSFORMER LAYER (downsample = 2.0)2. conv(96@(3,3))3. maxpool(2,2)4. dropout(p)5. conv(96@(3,3))6. maxpool(2,2)7. dropout(p)8. conv(96@(3,3))9. dropout(p)10. Dense(400)11. 3 × softmaxLocalisation network1. maxpool(2,2)2. conv(20@(3,3))3. maxpool(2,2)4. conv(20@(3,3))5. maxpool(2,2)6. conv(20@(3,3))7. Denselayer(200)8. Denselayer(6, linear output)This model is reported in Table 1 as the SPN conv. network.Finally we also trained the classification network from theFFN-SPN reported as conv. Net in Table 1.

Recurrent Spatial Transformer Networks

http://orbit.dtu.dk/ws/files/116956466/1509.05329v1.pdf

Recurrent Spatial Transformer Networks

Abstract

Similar works

Full text

Available Versions

Online Research Database In Technology