Recently, models based on deep neural networks have dominated the fields of
scene text detection and recognition. In this paper, we investigate the problem
of scene text spotting, which aims at simultaneous text detection and
recognition in natural images. An end-to-end trainable neural network model for
scene text spotting is proposed. The proposed model, named as Mask TextSpotter,
is inspired by the newly published work Mask R-CNN. Different from previous
methods that also accomplish text spotting with end-to-end trainable deep
neural networks, Mask TextSpotter takes advantage of simple and smooth
end-to-end learning procedure, in which precise text detection and recognition
are acquired via semantic segmentation. Moreover, it is superior to previous
methods in handling text instances of irregular shapes, for example, curved
text. Experiments on ICDAR2013, ICDAR2015 and Total-Text demonstrate that the
proposed method achieves state-of-the-art results in both scene text detection
and end-to-end text recognition tasks.Comment: To appear in ECCV 201