4 research outputs found
Handwritten and Printed Text Separation in Real Document
The aim of the paper is to separate handwritten and printed text from a real
document embedded with noise, graphics including annotations. Relying on
run-length smoothing algorithm (RLSA), the extracted pseudo-lines and
pseudo-words are used as basic blocks for classification. To handle this, a
multi-class support vector machine (SVM) with Gaussian kernel performs a first
labelling of each pseudo-word including the study of local neighbourhood. It
then propagates the context between neighbours so that we can correct possible
labelling errors. Considering running time complexity issue, we propose linear
complexity methods where we use k-NN with constraint. When using a kd-tree, it
is almost linearly proportional to the number of pseudo-words. The performance
of our system is close to 90%, even when very small learning dataset where
samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013
Séparation manuscrit et imprimé dans des documents administratifs complexes par utilisation de SVM et regroupement
International audienceThis paper proposes a methodology for the segmentation of printed and handwritten zones in document images. The documents are mainly of administrative type in an unconstrained industrial framework. We have to deal with a large number each day. They can come from different clients so as to their content, layout and digitization quality vary a lot. The goal is to isolate handwritten notes from the other parts, in order to apply in a second time some dedicated processing on the printed and the handwritten layers. To achieve that, we propose a four step procedure: preprocessing, geometrical layout analysis at pseudo-word level, classification using a SVM, then post-correction with context integration allowing a better quality. The classification rates are around 90% for segmenting printed, handwritten and noisy zones.Cet article propose une méthodologie pour la séparation de l'imprimé et du manuscrit dans des images de documents. Les documents à traiter sont majoritairement de type administratif dans un environnement industriel sans contrainte, à savoir une masse quotidienne et importante de pages à traiter avec une grande diversité de contenu et de qualité de numérisation. L'objectif est d'isoler toutes les annotations manuscrites afin d'effectuer par la suite des traitements spécifiques sur le plan du manuscrit et sur le plan de l'imprimé. Nous proposons une solution en plusieurs étapes qui sont: un prétraitement des images, une segmentation du contenu en "pseudo-mots", une reconnaissance par séparateur à vaste marge de la classe d'appartenance, puis une post-correction utilisant le contexte pour affiner la segmentation. Les résultats obtenus sont de l'ordre de 90% de bonne séparation entre l'imprimé, le manuscrit et le bruit