19 research outputs found

    READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents

    Full text link
    Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated 2036 archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing ground truth by these means is laborious and not needed to determine a method's quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization and it can handle skewed as well as rotated text lines. The ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts used this evaluation scheme. Finally, we present results achieved by a recently published text line detection algorithm.Comment: Submitted to DAS201

    Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software

    Get PDF
    This paper describes first large scale article detection and extraction efforts on the Finnish Digi newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898 . The historical digital newspaper archive environment of the NLF is based on commercial docWorks software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in t his respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laborator y of University of Rouen Normandy. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869 1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues . We then divided the annotated set in to training and evaluation set s of 168 and 56 pages. We trained PIVAJ successfully and evaluate d the results using the layout evaluation software developed by PRImA research laboratory of University of Salford. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.Peer reviewe

    Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks

    Full text link
    In this paper, we introduce a fully convolutional network for the document layout analysis task. While state-of-the-art methods are using models pre-trained on natural scene images, our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents. We consider the line segmentation task and more generally the layout analysis problem as a pixel-wise classification task then our model outputs a pixel-labeling of the input images. We show that Doc-UFCN outperforms state-of-the-art methods on various datasets and also demonstrate that the pre-trained parts on natural scene images are not required to reach good results. In addition, we show that pre-training on multiple document datasets can improve the performances. We evaluate the models using various metrics to have a fair and complete comparison between the methods

    You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine

    Full text link
    Layout Analysis (the identification of zones and their classification) is the first step along line segmentation in Optical Character Recognition and similar tasks. The ability of identifying main body of text from marginal text or running titles makes the difference between extracting the work full text of a digitized book and noisy outputs. We show that most segmenters focus on pixel classification and that polygonization of this output has not been used as a target for the latest competition on historical document (ICDAR 2017 and onwards), despite being the focus in the early 2010s. We propose to shift, for efficiency, the task from a pixel classification-based polygonization to an object detection using isothetic rectangles. We compare the output of Kraken and YOLOv5 in terms of segmentation and show that the later severely outperforms the first on small datasets (1110 samples and below). We release two datasets for training and evaluation on historical documents as well as a new package, YALTAi, which injects YOLOv5 in the segmentation pipeline of Kraken 4.1

    A Comparative Study of Two State-of-the-Art Feature Selection Algorithms for Texture-Based Pixel-Labeling Task of Ancient Documents

    Get PDF
    International audienceRecently, texture features have been widely used for historical document image analysis. However, few studies have focused exclusively on feature selection algorithms for historical document image analysis. Indeed, an important need has emerged to use a feature selection algorithm in data mining and machine learning tasks, since it helps to reduce the data dimensionality and to increase the algorithm performance such as a pixel classification algorithm. Therefore, in this paper we propose a comparative study of two conventional feature selection algorithms, genetic algorithm and ReliefF algorithm, using a classical pixel-labeling scheme based on analyzing and selecting texture features. The two assessed feature selection algorithms in this study have been applied on a training set of the HBR dataset in order to deduce the most selected texture features of each analyzed texture-based feature set. The evaluated feature sets in this study consist of numerous state-of-the-art texture features (Tamura, local binary patterns, gray-level run-length matrix, auto-correlation function, gray-level co-occurrence matrix, Gabor filters, Three-level Haar wavelet transform, three-level wavelet transform using 3-tap Daubechies filter and three-level wavelet transform using 4-tap Daubechies filter). In our experiments, a public corpus of historical document images provided in the context of the historical book recognition contest (HBR2013 dataset: PRImA, Salford, UK) has been used. Qualitative and numerical experiments are given in this study in order to provide a set of comprehensive guidelines on the strengths and the weaknesses of each assessed feature selection algorithm according to the used texture feature set

    A HYBRID PARAGRAPH-LEVEL PAGE SEGMENTATION

    Get PDF
    Automatic transformation of paper documents into electronic forms requires geometrydocument layout analysis at the rst stage. However, variations in character font sizes, text-linespacing, and layout structures have made it dicult to design a general-purpose method. Page seg-mentation algorithms usually segment text blocks using global separation objects, or local relationsamong connected components such as distance and orientation, but typically do not consider infor-mation other than local component's size. As a result, they cannot separate blocks that are veryclose to each other, including text of dierent font sizes and paragraphs in the same column. Toovercome this limitation, we proposed to use both separation objects at the whole page level andcontext analysis at text-line level to segment document images into paragraphs. The introduced hy-brid paragraph-level page segmentation (HP2S) algorithm can handle dicult cases where the purelytop-down and bottom-up approaches are not sucient to separate. Experimental results on the testset ICDAR2009 competition and UW-III dataset shown that our algorithm boost the performancesignicantly comparing to the state of the art algorithms

    Page Segmentation of Structured Documents Using 2D Stochastic Context-Free Grammars

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-38628-2_15n this paper we define a bidimensional extension of Stochastic Context-Free Grammars for page segmentation of structured documents. Two sets of text classification features are used to perform an initial classification of each zone of the page. Then, the page segmentation is obtained as the most likely hypothesis according to a grammar. This approach is compared to Conditional Random Fields and results show significant improvements in several cases. Furthermore, grammars provide a detailed segmentation that allowed a semantic evaluation which also validates this model.Work partially supported by the Spanish MEC under the STraDA research project (TIN2012-37475-C02-01), the MITTRAL (TIN2009- 14633-C03-01) project, the Spanish projects TIN2009-14633-C03-01/03 and 2010- CONES-00029, the FPU grant (AP2009-4363), by the Generalitat Valenciana under the grant Prometeo/2009/014, and through the EU 7th Framework Programme grant tranScriptorium (Ref: 600707)Álvaro Muñoz, F.; Cruz Fernández, F.; Sánchez Peiró, JA.; Ramos Terrades, O.; Benedí Ruiz, JM. (2013). Page Segmentation of Structured Documents Using 2D Stochastic Context-Free Grammars. En Pattern Recognition and Image Analysis. Springer. 133-140. https://doi.org/10.1007/978-3-642-38628-2_15133140Álvaro, F., Sánchez, J.A., Benedí, J.M.: Recognition of on-line handwritten mathematical expressions using 2d stochastic context-free grammars and hidden markov models. Pattern Recognition Letters (2012)An, C., Bird, H.S., Xiu, P.: Iterated document content classification. In: Proc. of ICDAR, Brazil, vol. 1, pp. 252–256 (2007)Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: Historical document layout analysis competition. In: Proc. of ICDAR, pp. 1516–1520 (2011)Bulacu, M., Koert, R., Schomaker, L., Zant, T.: Layout analysis of handwritten historical documents for searching the archive of the cabinet of the dutch queen. In: Proc. of ICDAR, Brazil, vol. 1, pp. 23–26 (2007)Crespi Reghizzi, S., Pradella, M.: A CKY parser for picture grammars. Information Processing Letters 105(6), 213–217 (2008)Cruz, F., Ramos Terrades, O.: Document segmentation using relative location features. In: Proc. of ICPR, Japan, pp. 1562–1565 (2012)Esteve, A., Cortina, C., Cabré, A.: Long term trends in marital age homogamy patterns: Spain, 1992-2006. Population 64(1), 173–202 (2009)Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. Int. Journal of Computer Vision 80(3), 300–316 (2008)Handley, J.C., Namboodiri, A.M., Zanibbi, R.: Document understanding system using stochastic context-free grammars. In: Proc. of ICDAR, vol. 1, pp. 511–515 (2005)Jain, A.K., Namboodiri, A.M., Subrahmonia, J.: Structure in online documents. In: Proc. of ICDAR, vol. 1, pp. 844–848 (2001)Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of ICML, USA, pp. 282–289 (2001
    corecore