Posting Paper on the Web

Abstract

We present a document processing system that accepts scanned images of paper documents as input and outputs hyperlinked electronic documents. The system segments document images, separating text from graphics, recognizes text, and creates hypertext links between document components (text, images, graphics). By (1) limiting input to popular Times-Roman and Helvetica fonts found in first-generation scans of columnated magazines and tabloids, and using (2) gray scale attributes, (3) multiple character prototypes to recognize kerned and touching characters, (4) a lexicon to find and correct recognition errors, and (5) providing user interaction to recognize problem words, we achieve OCR accuracies up to 99.8%. This compares closely to our measurements of human proofreading accuracy (99.94%) which, however, takes six times longer. A simple method for automated selection of important words in a document and creation of hypertext links from those words to other document components is develope..

    Similar works

    Full text

    thumbnail-image

    Available Versions