208 research outputs found

    Information Preserving Processing of Noisy Handwritten Document Images

    Get PDF
    Many pre-processing techniques that normalize artifacts and clean noise induce anomalies due to discretization of the document image. Important information that could be used at later stages may be lost. A proposed composite-model framework takes into account pre-printed information, user-added data, and digitization characteristics. Its benefits are demonstrated by experiments with statistically significant results. Separating pre-printed ruling lines from user-added handwriting shows how ruling lines impact people\u27s handwriting and how they can be exploited for identifying writers. Ruling line detection based on multi-line linear regression reduces the mean error of counting them from 0.10 to 0.03, 6.70 to 0.06, and 0.13 to 0.02, com- pared to an HMM-based approach on three standard test datasets, thereby reducing human correction time by 50%, 83%, and 72% on average. On 61 page images from 16 rule-form templates, the precision and recall of form cell recognition are increased by 2.7% and 3.7%, compared to a cross-matrix approach. Compensating for and exploiting ruling lines during feature extraction rather than pre-processing raises the writer identification accuracy from 61.2% to 67.7% on a 61-writer noisy Arabic dataset. Similarly, counteracting page-wise skew by subtracting it or transforming contours in a continuous coordinate system during feature extraction improves the writer identification accuracy. An implementation study of contour-hinge features reveals that utilizing the full probabilistic probability distribution function matrix improves the writer identification accuracy from 74.9% to 79.5%

    Table detection in handwritten chemistry documents using conditional random fields

    Get PDF
    International audienceIn this paper, we present a new approach using conditional random fields (CRFs) to localize tabular compo- nents in an unconstrained handwritten compound document. Given a line-segmented document, the extraction of table is considered as a labeling task that consists in assigning a label to each line: TableRow label for a line which belongs to a table and LineText label for a line which belongs to a text block. To perform the labeling task, we use a CRF model to combine two classifiers: a local classifier which assigns a label to the line based on local features and a contextual classifier which uses features taking into account the neighborhood. The CRF model gives the global conditional probability of a given labeling of the line considering the outputs of the two classifiers. A set of chemistry documents is used for the evaluation of this approach. The obtained results are around 88% of table lines correctly detecte

    Adaptive Algorithms for Automated Processing of Document Images

    Get PDF
    Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections. We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures. Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features. Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts

    Extraction of Scores and Average From Algerian High-School Degree Transcripts

    Get PDF
    A system for extracting scores and average from Algerian High School Degree Transcripts is proposed. The system extracts the scores and the average based on the localization of the tables gathering this information and it consists of several stages. After preprocessing, the system locates the tables using ruling-lines information as well as other text information. Therefore, the adopted localization approach can work even in the absence of certain ruling-lines or the erasure and discontinuity of lines. After that, the localized tables are segmented into columns and the columns into information cells. Finally, cells labeling is done based on the prior knowledge of the tables structure allowing to identify the scores and the average. Experiments have been conducted on a local dataset in order to evaluate the performances of our system and compare it with three public systems at three levels, and the obtained results show the effectiveness of our system

    Finding Similarities between Structured Documents as a Crucial Stage for Generic Structured Document Classifier

    Get PDF
    One of the addressed problems of classifying structured documents is the definition of a similarity measure that is applicable in real situations, where query documents are allowed to differ from the database templates. Furthermore, this approach might have rotated [1], noise corrupted [2], or manually edited form and documents as test sets using different schemes, making direct comparison crucial issue [3]. Another problem is huge amount of forms could be written in different languages, for example here in Malaysia forms could be written in Malay, Chinese, English, etc languages. In that case text recognition (like OCR) could not be applied in order to classify the requested documents taking into consideration that OCR is considered more easier and accurate rather than the layout  detection. Keywords: Feature Extraction, Document processing, Document Classification
    corecore