Document dewarping from a distorted camera-captured image is of great value
for OCR and document understanding. The document boundary plays an important
role which is more evident than the inner region in document dewarping. Current
learning-based methods mainly focus on complete boundary cases, leading to poor
document correction performance of documents with incomplete boundaries. In
contrast to these methods, this paper proposes MataDoc, the first method
focusing on arbitrary boundary document dewarping with margin and text aware
regularizations. Specifically, we design the margin regularization by
explicitly considering background consistency to enhance boundary perception.
Moreover, we introduce word position consistency to keep text lines straight in
rectified document images. To produce a comprehensive evaluation of MataDoc, we
propose a novel benchmark ArbDoc, mainly consisting of document images with
arbitrary boundaries in four typical scenarios. Extensive experiments confirm
the superiority of MataDoc with consideration for the incomplete boundary on
ArbDoc and also demonstrate the effectiveness of the proposed method on
DocUNet, DIR300, and WarpDoc datasets.Comment: 12 page