3 research outputs found

    Reading in the mist:high-quality optical character recognition based on freely available early modern digitized books

    Get PDF
    In this paper, we present a workflow for reworking digitized versions of early modern books, freely available in the public domain, in such a way that they will be capable of yielding high-quality optical character recognition (OCR) results suitable for computational text mining. Testing our method, we observed that anything above 90% OCR accuracy is sufficient for semantic analysis. In addition, the overall homogeneity in the OCR accuracy across the corpus proved to be more important than having perhaps only a few works with higher accuracy and the rest available in a lower quality. In terms of the OCR process, this paper illustrates how it was possible to reduce the processing time at maximum quality of a single book of average length (ca. 500 pages) from a minimum of 20 hrs to an average of about 3 hrs (though theoretically nearly infinitely reducible). This was achieved by replacing a step-by-step OCR process with a fully automated pipeline system run on an arbitrary number of servers, breaking up the full process of OCRing one book into minimal tasks that can be handled simultaneously by multiple servers

    Modelling Medieval Hands: Practical OCR for Caroline Minuscule

    Get PDF
    This article presents the results of a series of experiments with open-source neural network OCR software on a total of 88 medieval manuscripts ranging from the ninth through thirteenth centuries.[5] Our scope in these experiments focused mainly on manuscripts written in Caroline minuscule, as well as a handful of test cases toward the end of our date range written in what may be called “Late Caroline” and “Early Gothic” scripts (termed “transitional” when taken together).[6] In the following, we discuss the possibilities and challenges of using OCR on medieval manuscripts, neural network technology and its use in OCR software, the process and results of our experiments, and how these results offer a baseline for future research. Our results show potential for contributing to not only text recognition as such but also other areas of bibliography like paleographical analysis. In all of this, we want to emphasize the use of open-source software and sharing of data for decentralized, large-scale OCR with manuscripts in order to open up new collaborative avenues for innovation in the digital humanities and medieval studies
    corecore