63,004 research outputs found

    최적화 방법을 이용한 문서영상의 텍스트 라인 및 단어 검출법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 8. 조남익.Locating text-lines and segmenting words in a document image are important processes for various document image processing applications such as optical character recognition, document rectification, layout analysis and document image compression. Thus, there have been a lot of researches in this area, and the segmentation of machine-printed documents scanned by flatbed scanners have been matured to some extent. However, in the case of handwritten documents, it is considered a challenging problem since the features of handwritten document are irregular and diverse depending on a person and his/her language. To address this problem, this dissertation presents new segmentation algorithms which extract text-lines and words from a document image based on a new super-pixel representation method and a new energy minimization framework from its characteristics. The overview of the proposed algorithms is as follows. First, this dissertation presents a text-line extraction algorithm for handwritten documents based on an energy minimization framework with a new super-pixel representation scheme. In order to deal with the documents in various languages, a language-independent text-line extraction algorithm is developed based on the super-pixel representation with normalized connected components(CCs). Due to this normalization, the proposed method is able to estimate the states of super-pixels for a range of different languages and writing styles. From the estimated states, an energy function is formulated whose minimization yields text-lines. Experimental results show that the proposed method yields the state-of-the-art performance on various handwritten databases. Second, a preprocessing method of historical documents for text-line detection is presented. Unlike modern handwritten documents, historical documents suffer from various types of degradations. To alleviate these roblems, the preprocessing algorithm including robust binarization and noise removal is introduced in this dissertation. For the robust binarization of historical documents, global and local thresholding binarization methods are combined to deal with various degradations such as stains and fainted characters. Also, the energy minimization framework is modified to fit the characteristics of historical documents. Experimental results on two historical databases show that the proposed preprocessing method with text-line detection algorithm achieves the best detection performance on severely degraded historical documents. Third, this dissertation presents word segmentation algorithm based on structured learning framework. In this dissertation, the word segmentation problem is formulated as a labeling problem that assigns a label (intra- word/inter-word gap) to each gap between the characters in a given text-line. In order to address the feature irregularities especially on handwritten documents, the word segmentation problem is formulated as a binary quadratic assignment problem that considers pairwise correlations between the gaps as well as the likelihoods of individual gaps based on the proposed text-line extraction results. Even though many parameters are involved in the formulation, all parameters are estimated based on the structured SVM framework so that the proposed method works well regardless of writing styles and written languages without user-defined parameters. Experimental results on ICDAR 2009/2013 handwriting segmentation databases show that proposed method achieves the state-of-the-art performance on Latin-based and Indian languages.Abstract i Contents iii List of Figures vii List of Tables xiii 1 Introduction 1 1.1 Text-line Detection of Document Images 2 1.2 Word Segmentation of Document Images 5 1.3 Summary of Contribution 8 2 Related Work 11 2.1 Text-line Detection 11 2.2 Word Segmentation 13 3 Text-line Detection of Handwritten Document Images based on Energy Minimization 15 3.1 Proposed Approach for Text-line Detection 15 3.1.1 State Estimation of a Document Image 16 3.1.2 Problems with Under-segmented Super-pixels for Estimating States 18 3.1.3 A New Super-pixel Representation Method based on CC Partitioning 20 3.1.4 Cost Function for Text-line Segmentation 24 3.1.5 Minimization of Cost Function 27 3.2 Experimental Results of Various Handwritten Databases 30 3.2.1 Evaluation Measure 31 3.2.2 Parameter Selection 31 3.2.3 Experiment on HIT-MW Database 32 3.2.4 Experiment on ICDAR 2009/2013 Handwriting Segmentation Databases 35 3.2.5 Experiment on IAM Handwriting Database 38 3.2.6 Experiment on UMD Handwritten Arabic Database 46 3.2.7 Limitations 48 4 Preprocessing Method of Historical Document for Text-line Detection 53 4.1 Characteristics of Historical Documents 54 4.2 A Combined Approach for the Binarization of Historical Documents 56 4.3 Experimental Results of Text-line Detection for Historical Documents 61 4.3.1 Evaluation Measure and Configurations 61 4.3.2 George Washington Database 63 4.3.3 ICDAR 2015 ANDAR Datasets 65 5 Word Segmentation Method for Handwritten Documents based on Structured Learning 69 5.1 Proposed Approach for Word Segmentation 69 5.1.1 Text-line Segmentation and Super-pixel Representation 70 5.1.2 Proposed Energy Function for Word Segmentation 71 5.2 Structured Learning Framework 72 5.2.1 Feature Vector 72 5.2.2 Parameter Estimation by Structured SVM 75 5.3 Experimental Results 77 6 Conclusions 83 Bibliography 85 Abstract (Korean) 96Docto

    Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription

    Full text link
    [EN] We present a process for cost-effective transcription of cursive handwritten text images that has been tested on a 1,000-page 17th-century book about botanical species. The process comprised two main tasks, namely: (1) preprocessing: page layout analysis, text line detection, and extraction; and (2) transcription of the extracted text line images. Both tasks were carried out with semiautomatic pro- cedures, aimed at incrementally minimizing user correction effort, by means of computer-assisted line detection and interactive handwritten text recognition technologies. The contribution derived from this work is three-fold. First, we provide a detailed human-supervised transcription of a relatively large historical handwritten book, ready to be searchable, indexable, and accessible to cultural heritage scholars as well as the general public. Second, we have conducted the first longitudinal study to date on interactive handwriting text recognition, for which we provide a very comprehensive user assessment of the real-world per- formance of the technologies involved in this work. Third, as a result of this process, we have produced a detailed transcription and document layout infor- mation (i.e. high-quality labeled data) ready to be used by researchers working on automated technologies for document analysis and recognition.This work is supported by the European Commission through the EU projects HIMANIS (JPICH program, Spanish, grant Ref. PCIN-2015-068) and READ (Horizon-2020 program, grant Ref. 674943); and the Universitat Politecnica de Valencia (grant number SP20130189). This work was also part of the Valorization and I+D+i Resources program of VLC/CAMPUS and has been funded by the Spanish MECD as part of the International Excellence Campus program.Toselli, AH.; Leiva, LA.; Bordes-Cabrera, I.; Hernández-Tornero, C.; Bosch Campos, V.; Vidal, E. (2018). Transcribing a 17th-century botanical manuscript: Longitudinal evaluation of document layout detection and interactive transcription. Digital Scholarship in the Humanities. 33(1):173-202. https://doi.org/10.1093/llc/fqw064S173202331Bazzi, I., Schwartz, R., & Makhoul, J. (1999). An omnifont open-vocabulary OCR system for English and Arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(6), 495-504. doi:10.1109/34.771314Causer, T., Tonra, J., & Wallace, V. (2012). Transcription maximized; expense minimized? Crowdsourcing and editing The Collected Works of Jeremy Bentham*. Literary and Linguistic Computing, 27(2), 119-137. doi:10.1093/llc/fqs004Ramel, J. Y., Leriche, S., Demonet, M. L., & Busson, S. (2007). User-driven page layout analysis of historical printed books. International Journal of Document Analysis and Recognition (IJDAR), 9(2-4), 243-261. doi:10.1007/s10032-007-0040-6Romero, V., Fornés, A., Serrano, N., Sánchez, J. A., Toselli, A. H., Frinken, V., … Lladós, J. (2013). The ESPOSALLES database: An ancient marriage license corpus for off-line handwriting recognition. Pattern Recognition, 46(6), 1658-1669. doi:10.1016/j.patcog.2012.11.024Romero, V., Toselli, A. H., & Vidal, E. (2012). Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence. doi:10.1142/8394Toselli, A. H., Romero, V., Pastor, M., & Vidal, E. (2010). Multimodal interactive transcription of text images. Pattern Recognition, 43(5), 1814-1825. doi:10.1016/j.patcog.2009.11.019Toselli, A. H., Vidal, E., Romero, V., & Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences, 370-371, 497-518. doi:10.1016/j.ins.2016.07.063Bunke, H., Bengio, S., & Vinciarelli, A. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 709-720. doi:10.1109/tpami.2004.1

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    Baseline Detection in Historical Documents using Convolutional U-Nets

    Full text link
    Baseline detection is still a challenging task for heterogeneous collections of historical documents. We present a novel approach to baseline extraction in such settings, turning out the winning entry to the ICDAR 2017 Competition on Baseline detection (cBAD). It utilizes deep convolutional nets (CNNs) for both, the actual extraction of baselines, as well as for a simple form of layout analysis in a pre-processing step. To the best of our knowledge it is the first CNN-based system for baseline extraction applying a U-net architecture and sliding window detection, profiting from a high local accuracy of the candidate lines extracted. Final baseline post-processing complements our approach, compensating for inaccuracies mainly due to missing context information during sliding window detection. We experimentally evaluate the components of our system individually on the cBAD dataset. Moreover, we investigate how it generalizes to different data by means of the dataset used for the baseline extraction task of the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts (HisDoc). A comparison with the results reported for HisDoc shows that it also outperforms the contestants of the latter.Comment: 6 pages, accepted to DAS 201

    Semantics-Based Content Extraction in Typewritten Historical Documents

    No full text
    This paper presents a flexible approach to extracting content from scanned historical documents using semantic information. The final electronic document is the result of a "digital historical document lifecycle" process, where the expert knowledge of the historian/archivist user is incorporated at different stages. Results show that such a conversion strategy aided by (expert) user-specified semantic information and which enables the processing of individual parts of the document in a specialised way, produces superior (in a variety of significant ways) results than document analysis and understanding techniques devised for contemporary documents
    corecore