Unsupervised Code-Switching for Multilingual Historical Document Transcription

Dan Garrette; Dan Klein; Hannah Alpert-abrams; Taylor Berg-kirkpatrick

Unsupervised Code-Switching for Multilingual Historical Document Transcription

Authors: Dan Garrette
Dan Klein
Hannah Alpert-abrams
Taylor Berg-kirkpatrick
Publication date: 1 January 2015
Publisher: 'Association for Computational Linguistics (ACL)'
Doi

Abstract

Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages—a common feature of 16th century texts. Additionally, many of these documents precede consistent ortho-graphic conventions, making the task even harder. We extend the state-of-the-art his-torical OCR model of Berg-Kirkpatrick et al. (2013) to handle word-level code-switching between multiple languages. Further, we en-able our system to handle spelling variabil-ity, including now-obsolete shorthand systems used by printers. Our results show average rel-ative character error reductions of 14 % across a variety of historical texts.

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.698.6...

Last time updated on 29/10/2017

Crossref

info:doi/10.3115%2Fv1%2Fn15-11...

Last time updated on 01/04/2019