12 research outputs found
User-driven Page Layout Analysis of historical printed Books
International audienceIn this paper, based on the study of the specificity of historical printed books, we first explain the main error sources in classical methods used for page layout analysis. We show that each method (bottom-up and top-down) provides different types of useful information that should not be ignored, if we want to obtain both a generic method and good segmentation results. Next, we propose to use a hybrid segmentation algorithm that builds two maps: a shape map that focuses on connected components and a background map, which provides information about white areas corresponding to block separations in the page. Using this first segmentation, a classification of the extracted blocks can be achieved according to scenarios produced by the user. These scenarios are defined very simply during an interactive stage. The user is able to make processing sequences adapted to the different kinds of images he is likely to meet and according to the user needs. The proposed âuser-driven approachâ is capable of doing segmentation and labelling of the required user high level concepts efficiently and has achieved above 93% accurate results over different data sets tested. User feedbacks and experimental results demonstrate the effectiveness and usability of our framework mainly because the extraction rules can be defined without difficulty and parameters are not sensitive to page layout variation
Segmentation de documents composites par une technique de recouvrement des espaces blancs
International audienceWe present here a method for the segmentation of composite documents. Unlike most publications, we focus on non-Manhattan layouts which are usually created by compositing. Therefore, the pages to be processed contain several sub-documents which have to be isolated. We draw inspiration from the white space cover technique introduced by Baird et al. and a suite of pre- and post-processings specific to these particular documents. The evaluations are made on administrative records coming from various sources and provided to us by our industrial partner. As we do not have any groundtruth documents we compared our results with those obtained by a commercial OCR which is outperformed by our method.Nous présentons dans cet article une méthode pour la segmentation de documents composites. Contrairement à la majorité des publications, nous nous focalisons sur des documents à structure non-Manhattan qui sont généralement créés par montage. Les pages à traiter contiennent donc plusieurs sous-documents qu'il faut isoler. Nous nous inspirons d'une technique par recouvrement d'espaces blancs proposée par Baird et al. ainsi qu'une suite de pré-traitements et post-traitements spécifiques à ces documents particuliers. Les évaluations sont faites sur des documents administratifs d'origines diverses qui nous sont fournis par une société partenaire. Ne disposant pas de documents de vérité, nous avons comparé nos résultats à ceux d'OCR commerciaux que notre méthode surpasse
Frame Removal For Mushaf Al-Quran Using Irregular Binary Region
Segmentation is a process to remove frame or frame exists in
each page of some releases of mushaf Al-Quran. The fault in
segmentation process affects the holiness of Al-Quran. The
difficulty to identify the appearance of frame around text areas as well as noisy black stripes has caused the segmentation process to be improperly carried out. In this paper, an algorithm for detecting the frame on Al-Quran page without affecting its content is proposed. Firstly, preprocessing was carried out by using the binarisation method. Then, it was followed with the process of detecting the frame in each page. In this stage, the proposed algorithm was applied by calculating the percentage of black pixel of binary from vertical (column) to horizontal (row). The results,
based on experiments on several Al-Quran pages from different Al-Quran styles, demonstrate the effectiveness of the proposed techniqu
Adaptive Algorithms for Automated Processing of Document Images
Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections.
We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures.
Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features.
Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts
Interactive segmentation and analysis of historical printed documents
In this paper, we first precise the main error sources from classical methods of structural page layout analysis based on
a study of the specificity of old printed books. We show that each type of methods (bottom-up and top-down) provides
different kinds of information that should not be ignored to obtain both a generic method and good segmentation
results. Then, we propose to use a hybrid segmentation algorithm. We build two maps : a shape map that focuses on
connected components and a background map that provides information on white areas corresponding to block
separation in the page. Then, using this first segmentation, a classification of the extracted blocks can be achieved
according to scenarios built by the user. These scenarios are defined very simply during an interactive stage allowing
the users to produce processing sequences adapted to the different kinds of images they can meet and to their needs.
The method gives very good results while the setting of parameters is easy and not sensitive to low variations.AprÚs avoir caractérisé les spécificités de mise en page dans les ouvrages imprimés anciens, nous montrons
par une campagne dâexpĂ©rimentations que les mĂ©thodes ascendantes et descendantes dâextraction
de la structure physique apportent des informations diffĂ©rentes quâil ne faut pas ignorer lorsque lâon dĂ©sire
segmenter de maniÚre optimale des documents anciens. Les tests réalisés mettent également en évidence
les sources dâerreurs des mĂ©thodes traditionnelles. Partant de ces constatations, notre proposition consiste Ă
utiliser un algorithme de segmentation hybride basĂ© sur la construction de deux reprĂ©sentations de lâimage :
une carte des formes qui se focalise sur les composantes connexes prĂ©sentes dans lâimage et une carte du
fond qui fournit de lâinformation sur les espaces blancs sĂ©parant les blocs constituant la page. Ensuite,
sur la base de la segmentation obtenue Ă lâaide de cette mĂ©thode, une classification des blocs extraits peut
ĂȘtre rĂ©alisĂ©e selon des scĂ©narios que lâutilisateur met en place en fonction de ses besoins. Ces scĂ©narios sont
dĂ©finis simplement grĂące Ă une phase dâinteraction entre lâutilisateur et le systĂšme et permettent de concevoir
des chaĂźnes de traitements adaptĂ©es aux diffĂ©rents types dâimages que lâon peut rencontrer
Optical Character Recognition of Printed Persian/Arabic Documents
Texts are an important representation of language. Due to the volume of texts generated and the historical value of some documents, it is imperative to use computers to read generated texts, and make them editable and searchable. This task, however, is not trivial. Recreating human perception capabilities in artificial systems like documents is one of the major goals of pattern recognition research. After decades of research and improvements in computing capabilities, humans\u27 ability to read typed or handwritten text is hardly matched by machine intelligence. Although, classical applications of Optical Character Recognition (OCR) like reading machine-printed addresses in a mail sorting machine is considered solved, more complex scripts or handwritten texts push the limits of the existing technology. Moreover, many of the existing OCR systems are language dependent. Therefore, improvements in OCR technologies have been uneven across different languages. Especially, for Persian, there has been limited research. Despite the need to process many Persian historical documents or use of OCR in variety of applications, few Persian OCR systems work with good recognition rate. Consequently, the task of automatically reading Persian typed documents with close-to-human performance is still an open problem and the main focus of this dissertation. In this dissertation, after a literature survey of the existing technology, we propose new techniques in the two important preprocessing steps in any OCR system: Skew detection and Page segmentation. Then, rather than the usual practice of character segmentation, we propose segmentation of Persian documents into sub-words. The choice of sub-word segmentation is to avoid the challenges of segmenting highly cursive Persian texts to isolated characters. For feature extraction, we will propose a hybrid scheme between three commonly used methods and finally use a nonparametric classification method. A large number of papers and patents advertise recognition rates near 100%. Such claims give the impression that automation problems seem to have been solved. Although OCR is widely used, its accuracy today is still far from a child\u27s reading skills. Failure of some real applications show that performance problems still exist on composite and degraded documents and that there is still room for progress
High precision text extraction from PDF documents
Oppgaven tar for seg problemet med uthenting av informasjon fra dokumenter lagret i PDF-formatet, noe som er vanskelig pÄ grunn av at informasjonen blir lagret visuelt og uten en god struktur.
I oppgaven blir det sett pÄ bruk og tilpassning av teori hentet fra OCR for Ä prÞve Ä gjenopprette denne tapte strukturen
Geometric Layout Analysis of Scanned Documents
Layout analysis--the division of page images into text blocks, lines, and determination of their reading order--is a major performance limiting step in large scale document digitization projects. This thesis addresses this problem in several ways: it presents new performance measures to identify important classes of layout errors, evaluates the performance of state-of-the-art layout analysis algorithms, presents a number of methods to reduce the error rate and catastrophic failures occurring during layout analysis, and develops a statistically motivated, trainable layout analysis system that addresses the needs of large-scale document analysis applications. An overview of the key contributions of this thesis is as follows. First, this thesis presents an efficient local adaptive thresholding algorithm that yields the same quality of binarization as that of state-of-the-art local binarization methods, but runs in time close to that of global thresholding methods, independent of the local window size. Tests on the UW-1 dataset demonstrate a 20-fold speedup compared to traditional local thresholding techniques. Then, this thesis presents a new perspective for document image cleanup. Instead of trying to explicitly detect and remove marginal noise, the approach focuses on locating the page frame, i.e. the actual page contents area. A geometric matching algorithm is presented to extract the page frame of a structured document. It is demonstrated that incorporating page frame detection step into document processing chain results in a reduction in OCR error rates from 4.3% to 1.7% (n=4,831,618 characters) on the UW-III dataset and layout-based retrieval error rates from 7.5% to 5.3% (n=815 documents) on the MARG dataset. The performance of six widely used page segmentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database is evaluated in this work using a state-of-the-art evaluation methodology. It is shown that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. Thus, a vectorial score is introduced that is sensitive to, and identifies, the most important classes of segmentation errors (over-, under-, and mis-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, this evaluation method has a canonical representation of ground truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. Based on a detailed analysis of the errors made by different page segmentation algorithms, this thesis presents a novel combination of the line-based approach by Breuel with the area-based approach of Baird which solves the over-segmentation problem in area-based approaches. This new approach achieves a mean text-line extraction error rate of 4.4% (n=878 documents) on the UW-III dataset, which is the lowest among the analyzed algorithms. This thesis also describes a simple, fast, and accurate system for document image zone classification that results from a detailed comparative analysis of performance of widely used features in document analysis and content-based image retrieval. Using a novel combination of known algorithms, an error rate of 1.46% (n=13,811 zones) is achieved on the UW-III dataset in comparison to a state-of-the-art system that reports an error rate of 1.55% (n=24,177 zones) using more complicated techniques. In addition to layout analysis of Roman script documents, this work also presents the first high-performance layout analysis method for Urdu script. For that purpose a geometric text-line model for Urdu script is presented. It is shown that the method can accurately extract Urdu text-lines from documents of different layouts like prose books, poetry books, magazines, and newspapers. Finally, this thesis presents a novel algorithm for probabilistic layout analysis that specifically addresses the needs of large-scale digitization projects. The presented approach models known page layouts as a structural mixture model. A probabilistic matching algorithm is presented that gives multiple interpretations of input layout with associated probabilities. An algorithm based on A* search is presented for finding the most likely layout of a page, given its structural layout model. For training layout models, an EM-like algorithm is presented that is capable of learning the geometric variability of layout structures from data, without the need for a page segmentation ground-truth. Evaluation of the algorithm on documents from the MARG dataset shows an accuracy of above 95% for geometric layout analysis.Geometrische Layout-Analyse von gescannten Dokumente