7 research outputs found
On the modification of binarization algorithms to retain grayscale information for handwritten text recognition
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-19390-8_24[EN] The amount of digitized legacy documents has been rising over the last years due mainly to the increasing number of on-line digital libraries publishing this kind of documents. The vast majority of them remain waiting to be transcribed to provide historians and other researchers new ways of indexing, consulting and querying them. However, the performance accuracy of state-of-the-art Handwritten Text Recognition techniques decreases dramatically when they are applied to these historical documents. This is mainly due to the typical paper degradation problems. Therefore, robust pre-processing techniques is an important step for helping further recognition steps. This paper proposes to take existing binarization techniques, in order to retain their advantages, and modify them in such a way that some of the original grayscale information is preserved and be considered by the subsequent recognizer. Results are reported with the publicly available ESPOSALLES database.The research leading to these results has received funding from the European Unionâs Seventh Framework Programme FP7/2007-2013) under grant agreement No. 600707 - tranScriptorium and the Spanish MEC under the STraDA project (TIN2012-37475-C02-01).Villegas, M.; Romero GĂłmez, V.; SĂĄnchez PeirĂł, JA. (2015). On the modification of binarization algorithms to retain grayscale information for handwritten text recognition. En Pattern Recognition and Image Analysis: 7th Iberian Conference, IbPRIA 2015, Santiago de Compostela, Spain, June 17-19, 2015, Proceedings. Springer International Publishing. 208-215. https://doi.org/10.1007/978-3-319-19390-8_24S208215Drida, F.: Towards restoring historic documents degraded over time. In: Proceedings of 2nd IEEE International Conference on Document Image Analysis for Libraries (DIAL 2006), Lyon, France, pp. 350â357 (2006)Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855â868 (2009)Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Khurshid, K., Siddiqi, I., Faure, C., Vincent, N.: Comparison of niblack inspired binarization methods for ancient documents. In: Berkner, K., Likforman-Sulem, L. (eds.) 16th Document Recognition and Retrieval Conference, DRR 2009, SPIE Proceedings, vol. 7247, pp. 1â10. SPIE, San Jose (18â22 January 2009). doi: 10.1117/12.805827Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling, Detroit, USA, vol. 1, pp. 181â184 (1995)Marti, U., Bunke, H.: Using a statistical language model to improve the preformance of an HMM-based cursive handwriting recognition system. IJPRAI 15(1), 65â90 (2001)Niblack, W.: An Introduction to Digital Image Processing, pp. 115â116. Prentice-Hall, Englewood Cliffs (1986)Romero, V., FornĂ©s, A., Serrano, N., SĂĄnchez, J., Toselli, A., Frinken, V., Vidal, E., LladĂłs, J.: The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recogn. 46, 1658â1669 (2013). doi: 10.1016/j.patcog.2012.11.024España-Boquera, S., Castro-Bleda, M.J., Gorbe-Moya, J., Zamora-MartĂnez, F.: Improving offline handwriting text recognition with hybrid hmm/ann models. IEEE Trans. Pattern Anal. Mach. Intell. 33(4), 767â779 (2011)Sauvola, J., PietikĂ€inen, M.: Adaptive document image binarization. Pattern Recog. 33(2), 225â236 (2000). doi: 10.1016/S0031-3203(99)00055-2Shafait, F., Keysers, D., Breuel, T.M.: Efficient implementation of local adaptive thresholding techniques using integral images. In: Proceedings of the SPIE 6815, Document Recognition and Retrieval XV, 681510, pp. 1â6, January 2008. doi: 10.1117/12.767755Toselli, A.H., Juan, A., Keysers, D., GonzĂĄlez, J., Salvador, I., Ney, H., Vidal, E., Casacuberta, F.: Integrated handwriting recognition and interpretation using finite-state models. Int. J. Pattern Recog. Artif. Intell. 18(4), 519â539 (2004). doi: 10.1142/S021800140400334
Modern vs Diplomatic Transcripts for Historical Handwritten Text Recognition
Abstract:
The transcription of handwritten documents is useful to make their contents accessible to the general public. However, so far automatic transcription of historical documents has mostly focused on producing diplomatic transcripts, even if such transcripts are often only understandable by experts. Main difficulties come from the heavy use of extremely abridged and tangled abbreviations and archaic or outdated word forms. Here we study different approaches to train optical models which allow to recognize historic document images containing archaic and abbreviated handwritten text and produce modernized transcripts with expanded abbreviations. Experiments comparing the performance of the different approaches proposed are carried out on a document collection related with Spanish naval commerce during the XVâXIX centuries, which includes extremely difficult handwritten text image