4 research outputs found

    Handwritten and Printed Text Separation in Real Document

    Get PDF
    The aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset where samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013

    Séparation manuscrit et imprimé dans des documents administratifs complexes par utilisation de SVM et regroupement

    Get PDF
    International audienceThis paper proposes a methodology for the segmentation of printed and handwritten zones in document images. The documents are mainly of administrative type in an unconstrained industrial framework. We have to deal with a large number each day. They can come from different clients so as to their content, layout and digitization quality vary a lot. The goal is to isolate handwritten notes from the other parts, in order to apply in a second time some dedicated processing on the printed and the handwritten layers. To achieve that, we propose a four step procedure: preprocessing, geometrical layout analysis at pseudo-word level, classification using a SVM, then post-correction with context integration allowing a better quality. The classification rates are around 90% for segmenting printed, handwritten and noisy zones.Cet article propose une méthodologie pour la séparation de l'imprimé et du manuscrit dans des images de documents. Les documents à traiter sont majoritairement de type administratif dans un environnement industriel sans contrainte, à savoir une masse quotidienne et importante de pages à traiter avec une grande diversité de contenu et de qualité de numérisation. L'objectif est d'isoler toutes les annotations manuscrites afin d'effectuer par la suite des traitements spécifiques sur le plan du manuscrit et sur le plan de l'imprimé. Nous proposons une solution en plusieurs étapes qui sont: un prétraitement des images, une segmentation du contenu en "pseudo-mots", une reconnaissance par séparateur à vaste marge de la classe d'appartenance, puis une post-correction utilisant le contexte pour affiner la segmentation. Les résultats obtenus sont de l'ordre de 90% de bonne séparation entre l'imprimé, le manuscrit et le bruit
    corecore