Unsupervised Code-Switching for Multilingual Historical Document Transcription

Abstract

Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages—a common feature of 16th century texts. Additionally, many of these documents precede consistent ortho-graphic conventions, making the task even harder. We extend the state-of-the-art his-torical OCR model of Berg-Kirkpatrick et al. (2013) to handle word-level code-switching between multiple languages. Further, we en-able our system to handle spelling variabil-ity, including now-obsolete shorthand systems used by printers. Our results show average rel-ative character error reductions of 14 % across a variety of historical texts.

    Similar works

    Full text

    thumbnail-image

    Available Versions