3 research outputs found
Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription
[EN] We present a process for cost-effective transcription of cursive handwritten text
images that has been tested on a 1,000-page 17th-century book about botanical
species. The process comprised two main tasks, namely: (1) preprocessing: page
layout analysis, text line detection, and extraction; and (2) transcription of the
extracted text line images. Both tasks were carried out with semiautomatic pro-
cedures, aimed at incrementally minimizing user correction effort, by means of
computer-assisted line detection and interactive handwritten text recognition
technologies. The contribution derived from this work is three-fold. First, we
provide a detailed human-supervised transcription of a relatively large historical
handwritten book, ready to be searchable, indexable, and accessible to cultural
heritage scholars as well as the general public. Second, we have conducted the
first longitudinal study to date on interactive handwriting text recognition, for
which we provide a very comprehensive user assessment of the real-world per-
formance of the technologies involved in this work. Third, as a result of this
process, we have produced a detailed transcription and document layout infor-
mation (i.e. high-quality labeled data) ready to be used by researchers working on
automated technologies for document analysis and recognition.This work is supported by the European Commission through the EU projects HIMANIS (JPICH program, Spanish, grant Ref. PCIN-2015-068) and READ (Horizon-2020 program, grant Ref. 674943); and the Universitat Politecnica de Valencia (grant number SP20130189). This work was also part of the Valorization and I+D+i Resources program of VLC/CAMPUS and has been funded by the Spanish MECD as part of the International Excellence Campus program.Toselli, AH.; Leiva, LA.; Bordes-Cabrera, I.; Hernández-Tornero, C.; Bosch Campos, V.; Vidal, E. (2018). Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription. Digital Scholarship in the Humanities. 33(1):173-202. https://doi.org/10.1093/llc/fqw064S173202331Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary OCR system for English and Arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495-504. doi:10.1109/34.771314Causer, T., Tonra, J., & Wallace, V. (2012). Transcription maximized; expense minimized? Crowdsourcing and editing The Collected Works of Jeremy Bentham*. Literary and Linguistic Computing, 27(2), 119-137. doi:10.1093/llc/fqs004Ramel, J. Y., Leriche, S., Demonet, M. L., & Busson, S. (2007). User-driven page layout analysis of historical printed books. International Journal of Document Analysis and Recognition (IJDAR), 9(2-4), 243-261. doi:10.1007/s10032-007-0040-6Romero, V., Fornés, A., Serrano, N., Sánchez, J. A., Toselli, A. H., Frinken, V., … Lladós, J. (2013). The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition, 46(6), 1658-1669. doi:10.1016/j.patcog.2012.11.024Romero, V., Toselli, A. H., & Vidal, E. (2012). Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence. doi:10.1142/8394Toselli, A. H., Romero, V., Pastor, M., & Vidal, E. (2010). Multimodal interactive transcription of text images. Pattern Recognition, 43(5), 1814-1825. doi:10.1016/j.patcog.2009.11.019Toselli, A. H., Vidal, E., Romero, V., & Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences, 370-371, 497-518. doi:10.1016/j.ins.2016.07.063Bunke, H., Bengio, S., & Vinciarelli, A. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709-720. doi:10.1109/tpami.2004.1
A mixed approach for handwritten documents structural analysis
International audienceIn this paper we propose a new method for document pages segmentation. First dedicated to handwritten documents, our method is designed to extract the different text zones, paragraph and fragment in unconstraint documents. The proposed approach is a mixed one, using both the advantages of top-down and bottom-up approaches. In this paper we proposed and evaluation of our methods on a 183 documents database, taken from a 19th century handwritten corpus : the "dossiers de Bouvard et Pécuchet" from Flaubert. With this evaluation we demonstrate that the combination of the top-down and the bottom-up approach allow to improve the obtained results