4 research outputs found
The Robust Reading Competition Annotation and Evaluation Platform
The ICDAR Robust Reading Competition (RRC), initiated in 2003 and
re-established in 2011, has become a de-facto evaluation standard for robust
reading systems and algorithms. Concurrent with its second incarnation in 2011,
a continuous effort started to develop an on-line framework to facilitate the
hosting and management of competitions. This paper outlines the Robust Reading
Competition Annotation and Evaluation Platform, the backbone of the
competitions. The RRC Annotation and Evaluation Platform is a modular
framework, fully accessible through on-line interfaces. It comprises a
collection of tools and services for managing all processes involved with
defining and evaluating a research task, from dataset definition to annotation
management, evaluation specification and results analysis. Although the
framework has been designed with robust reading research in mind, many of the
provided tools are generic by design. All aspects of the RRC Annotation and
Evaluation Framework are available for research use.Comment: 6 pages, accepted to DAS 201
ZoneMapAlt: An alternative to the ZoneMap metric for zone segmentation and classification
International audienceThis paper proposes a new evaluation metric based on the existing ZoneMap metric. The ZoneMap method, designed to perform a zone segmentation evaluation and classification, is considered in the context of OCR evaluation. Its limits are spotted, described and a new algorithm, ZoneMapAlt (ZoneMap Alternative) is proposed to solve the identified limits while keeping the properties of the original one. To validate the new metric, experiments have been made on a dataset of scientific articles. Results demonstrate that the ZoneMapAlt algorithm provides greater details on seg-mentation errors and is able to detect critical segmentation errors
Metrics for Complete Evaluation of OCR Performance
International audienceIn this paper, we study metrics for evaluating OCR performance both in terms of physical segmentation and in terms of textual content recognition. These metrics rely on the OCR output (hypothesis) and the reference (also called ground truth) input format. Two evaluation criteria are considered: the quality of segmentation and the character recognition rate. Three pairs of input formats are selected among two types of inputs: text only (text) and text with spatial information (xml). These pairs of inputs reference-to-hypothesis are: 1) text-to-text, 2) xml-to-xml and 3) text-to-xml. For the text-to-text pair, we selected the RETAS method to perform experiments and show its limits. Regarding text-to-xml, a new method based on unique word anchors is proposed to solve the problem of aligning texts with different information. We define the ZoneMapAltCnt metric for the xml-to-xml approach and show that it offers the most reliable and complete evaluation compared to the other two. Open source OCRs like Tesseract and OCRopus are selected to perform experiments. The datasets used are a collection of documents from the ISTEX 1 document database, from French newspaper "Le Nouvel Observateur" as well as invoices and administrative document gathered from different collaborations