The aim of the paper is to separate handwritten and printed text from a real
document embedded with noise, graphics including annotations. Relying on
run-length smoothing algorithm (RLSA), the extracted pseudo-lines and
pseudo-words are used as basic blocks for classification. To handle this, a
multi-class support vector machine (SVM) with Gaussian kernel performs a first
labelling of each pseudo-word including the study of local neighbourhood. It
then propagates the context between neighbours so that we can correct possible
labelling errors. Considering running time complexity issue, we propose linear
complexity methods where we use k-NN with constraint. When using a kd-tree, it
is almost linearly proportional to the number of pseudo-words. The performance
of our system is close to 90%, even when very small learning dataset where
samples are basically composed of complex administrative documents.Comment: Machine Vision Applications (2013

Belaïd, Abdel

D'Andecy, Vincent Poulain

Santosh, K. C.

English

arXiv

International audienceThe aim of the paper is to separate handwritten and printed text from a real document embedded with noise, graphics including annotations. Relying on run-length smoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks for classification. To handle this, a multi-class support vector machine (SVM) with Gaussian kernel performs a first labelling of each pseudo-word including the study of local neighbourhood. It then propagates the context between neighbours so that we can correct possible labelling errors. Considering running time complexity issue, we propose linear complexity methods where we use k-NN with constraint. When using a kd-tree, it is almost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%, even when very small learning dataset are used, where samples are basically composed of complex administrative documents

K.C., Santosh

Poulain d'Andecy, Vincent

INRIA a CCSD electronic archive server

HAL Id: hal-00799331https://hal.inria.fr/hal-00799331v2Submitted on 19 Mar 2013HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.Handwritten and Printed Text Separation in RealDocumentAbdel Belaïd, Santosh K.C., Vincent Poulain d’AndecyTo cite this version:Abdel Belaïd, Santosh K.C., Vincent Poulain d’Andecy. Handwritten and Printed Text Separationin Real Document. The Thirteenth IAPR International Conference on Machine Vision Applications -2013, May 2013, Kyoto, Japan. ￿hal-00799331v2￿Handwritten and Printed Text Separation in Real DocumentAbdel Beläıd, K.C. SantoshLORIA - Université de Lorraine54506 Vandoeuvre-lès- Nancy, France{abdel.belaid, santosh.kc}@loria.frVincent Poulain d’AndecyITESOFT Parc dAndron, Le Séquoia,30470, Aimargues, Francevincent.poulaindandecy@itesoft.comAbstractThe aim of the paper is to separate handwritten andprinted text from a real document embedded with noise,graphics including annotations. Relying on run-lengthsmoothing algorithm (RLSA), the extracted pseudo-lines and pseudo-words are used as basic blocks forclassification. To handle this, a multi-class supportvector machine (SVM) with Gaussian kernel performsa first labelling of each pseudo-word including the studyof local neighbourhood. It then propagates the contextbetween neighbours so that we can correct possible la-belling errors. Considering running time complexityissue, we propose linear complexity methods where weuse k-NN with constraint. When using a kd-tree, it isalmost linearly proportional to the number of pseudo-words. The performance of our system is close to 90%,even when very small learning dataset are used, wheresamples are basically composed of complex administra-tive documents.1 IntroductionUnder the purview of document analysis and pro-cessing, we are in this paper, motivated to separatehandwritten and machine-printed text (H&P) so thatfurther processing is feasible such as document infor-mation exploitation and retrieval. In other words, sucha separation is an important step in the process be-cause it allows retro-conversion to avoid heavy treat-ments and errors when transcribing the content.Considering a continuous flow of administrative doc-uments into our system, we face a varieties of docu-ment types, content, quality and structure. Funda-mentally speaking, documents can be skewed, noisyand sometimes overlapped with graphics i.e., lines andunconstrained annotations. In this context, most ofthe image samples are required to be properly treated.Without integrating such tools, our system, in thisframework, aims to extract the annotations whateverthe language: French, German and English used inthe document, the content: typed or handwritten, anddocument structure: structured (e.g. tables), semi-structured (e.g. forms) and structure-free. Althoughthe segmentation topic has been studied since severalyears [1], different methods have been proposed to solveparticular aspects of the separation [2, 3]. Heteroge-neous document separation still remains an open prob-lem. Another strong industrial constraint is to reducerunning time so that the system can maintain speed.In addition, parameter-free methods are always bettersince they can generally be applied. In this paper, weare motivated by the work of Kandan et al. [4] whereseparation has been made into two classes by using de-scriptors that are insensitive to translation, rotationand scaling. Classifications using SVM and k-NN arepseudo-wordsegmentationword modeltrainingpre-processingdoc. imagewordclassificationcontextpropagationoutput:information classesFigure 1. Work-flow showing several consecutivestages, starting from pre-processing to outputi.e., H&P text separation.first investigated, and a re-classification step is thenperformed using a Delaunay triangulation. Zheng etal. proposed two segmentation approaches and eval-uated over noisy documents [5]. The first one is usedto determine the most appropriate segmentation wherea comparison is made between the segmentation intowords, lines and connected components. The latter onedeals with word classification by selecting 31 descrip-tors over a hundred. They also introduce informationabout class in order to take the noise into account.Fisher classifier is used to label the segmented blocksand Markov field then allows fine classification, consid-ering the contextual information of each word.The rest of this paper is organised as follows. Westart with detailing our proposed approach in Sec-tion 2. It mainly includes pre-processing, pseudo-wordsegmentation, word model training, word classificationand pseudo-word grouping. Full experimental results(and of course, analysis) are reported in Section 3. Thepaper is concluded in Section 4 including a few perspec-tives.2 The proposed approachAs illustrated in Fig. 1, our proposed approach con-sists of several consecutive steps. It includes pre-processing, pseudo-word segmentation, word modeltraining, word classification and context propagation.In what follows, we explain them, one after another.Preprocessing. The low quality documents require asignificant preprocessing. Our pre-processing is com-posed of the following steps:1. edge removal by using a rule system based onshape and position of the connected components(CC);2. noise filtering by using a modified kfill [6];3. slope detection by using the RAST method [7];and4. filtering by using the modified k-fill on the de-skewed document.Pseudo-word segmentation. In this section, we cre-ate regular and stable areas that will be used to label(a) (b)Figure 2. An example showing pre-processing:(a) input sample and (b) its corresponding out-put.(a)(b)Figure 3. Segmentation comparison: (a) classicalRLSA and (b) double RLSA. Extracted pseudo-words are framed and lines are identified by thecolor of the pseudo-words.theH&P zones in the document image. To handle this,we use a double RLSA as presented in Algorithm1 i.e.,it aims to provide fine word segmentation.In each one of the extracted lines, smearing is per-formed first and the distances between the boundingboxes of the adjacent CC are then calculated. This al-lows to construct a histogram that generally providesan overall shape appearance. It contains two dominantpeaks:1. the first corresponds to the most frequent gap be-tween CC that can be considered as the distancebetween characters of the same word; and2. the second peak corresponds to the most frequentgaps between words belonging to the same row.Note that the first peak can be considered as the dis-tance between the letters in every word and in a similarfashion, the second peak determines the threshold tobe used in pseudo-word segmentation. We can there-fore apply a second smearing that allows a finer seg-mentation because handwritten and printed words donot respect similar (usual) distances between the let-ters and words, and thus we are able to adapt the rowcontent segmentation. Fig. 3 illustrates the compari-son between the original and the double RLSA. In thisillustration, it is important to notice that words arewell segmented in case when double RLSA is used incontrast to text block (that sometimes contains severalwords within it) from classical RLSA.Word model training. As said before, we need re-liable models to separate H&P information. In orderto have these models, we perform two classes of learn-ing from samples by taking words representatives. WeAlgorithm 1 Segmentation by double smearing1: lines ← smearing(image)2: for all line L in lines do3: list edistances ← ∅4: for all CC c in line L do5: dmin ← distmin(listeccx , c)6: list edistances ← add(list edistances, dmin )7: end for8: compt ← bincount(list edistance)9: histo ← compt [2 :: 2] + compt [3 :: 2]10: i← argmax(histo)11: repeat12: previous ← histo[i]13: i← i+ 114: until histo[i] > previous15: dhs ← i+ 216: end forthen select several specific descriptors belonging to fourdifferent categories:1. morphological (local properties of pseudo-wordssuch as height, width and pixel number);2. CC descriptors (11 descriptors as proposed in [5]);3. pixel repartition (global descriptors like invari-ant HU moments, variance of the projection pro-files [4, 8]; and4. other local properties such as run length, cross-ing count and bi-level co-occurrences, as describedin [5].Classification. To handle pseudo-word classification,we employ a SVM. Although it is initially suggested toseparate only the H&P information, we use a multi-class SVM so that an additional class i.e., noise can betaken into account. To handle this, two approaches arebasically used: 1) the combination of bi-class SVM and2) the learning of a unique multi-class SVM (MSVM).MSVM is based on a principle similar to one-vs-all [9]where each class has its own decision function and theclass corresponding to the function giving the highestvalue wins. The difference is that, for a MSVM withQ classes, the Q functions are learnt at the same timewith exactly similar constraints. A single optimiza-tion problem is solved by using the maximization ofthe sum of the margins for each class. There are fourdifferent methods that differ in terms of applicationpenalty. We use the tool presented by Weston andWatkins [10] where it cumulates the penalty comparedto the margins of each class. The implementation iscarried out on the Weka platform and the SMO classi-fier with the extension of the problem into three classesby the method one-vs-one as described in Mayoraz etal. [11].Pseudo-word grouping. This re-grouping methoduses spatial proximity to re-group elementary units.For each component, k nearest neighbours are foundand the label of the component is compared with theones in their neighbours. If more than 50% of theneighbours share the same label, this label is assignedto the central component.Generally speaking, since text is written horizon-tally, horizontal proximity between components is pre-ferred to be vertical ones. Then, we define the distanceasd(e1, e2) =√(x1 − x2)2w2x + (y1 − y2)2w2y (1)Algorithm 2 k-NN grouping with constraints1: Require: ∀c ∈ C , old label(c) ∈ (L)2: for all c ∈ C do3: Neighb ← k nearest neighbour(k, c,max dist.)4: n = card(Neighb)5: new label [c]← old label [c]6: for all class ∈ (L) do7: Nc ← {x|x ∈ Neighb, old label [x] = class}8: if card(Nc) >n2then9: if∑x∈Nc area(x) >12(c) then,10: new label [c]← class11: end ifbreak12: end if13: end for14: end forwhere xi, yi are the coordinates of the center of gravityof CC ni, and wx;y are weights corresponding to eachaxis. In a similar manner, another distance is com-puted i.e., the distance is taken from the border of thebounding boxes. Based on the framework, in what fol-lows, we explain three different algorithms i.e., A1:A3.A1. Grouping by k-NN.It employs a classical k-NN algorithm where parame-ters k and a threshold i.e., max dist. The k nearestneighbours are taken into account if they are closerthan the pre-defined max dist. The distance parame-ter basically prevents far away neighbours to interferewith the component. In our case,max dist. has beenfixed to 1) 300 pixels for distance 1, and 2) 100 pixelsfor distance 2 with images at 300 dots per inch (dpi).Note that the distance 2 is lower than distance 1, anddepends of the relative positioning between the bound-ing boxes and their sizes. These thresholds however,are image resolution dependent.A2. Grouping by the NN with constraints.The algorithm can be improved by avoiding big com-ponents that are basically be corrupted by small ones(as noise). Before flipping the label of the component,we perform a test to check whether the accumulatedpixels of a neighbour contributing the change of labelis significant in comparison to the number of pixels ofthe tested component. For this, in our test, the sumshould be at least 50% of the main component. Notethat the opposite does not exist. Big components areregrouped with small ones to help gathering main textwith small components as commas, apostrophes or ac-cents. Moreover, big components contain more infor-mation so they are generally more reliable, and thusthe classification is more accurate. An overall idea ispresented in Algorithm 2.A3. Grouping by confidence voting.The classifier confidence helps to maintain the decision.Based on the idea of grouping via nearest neighboursin addition with some specific constraints, we examinethe confidence of the nearest neighbour of a selectedpseudo-word. If the latter is stronger than that of thepseudo-word, then it takes the neighbourhood class. AGaussian or polynomial law can weight the neighbourconfidence by its distance to the pseudo-word.3 Experiments3.1 Dataset and evaluation metricDataset. To perform the tests, we have selected 75documents for learning and a 300 documents for test-ing. As a reminder, these samples are taken from thereal-world industrial problem.Evaluation metric. Our evaluation of H&P separa-tion is performed according to the measure proposedby [12]. All test documents have been perfectly la-belled at pixel level, where performance is evaluated interms of recognition rate.Recognition rate =# of pixels correctly labelled# of pixels used. (2)3.2 Results and analysisTable 1 shows recognition rates for four groupingmethods. The k-NN uses k = 2. The methods’ con-fidence use respectively fgauss , fpoly2 and fpoly4 asweighting functions.fgauss(conf , dist) = conf × exp(−10−3 ∗ dist2conf 2)(3)fpoly2 (conf , dist) = −5 · 10−4(dist − 1conf)2+ conf (4)fpoly4 (conf , dist) = −10−6(dist − 1conf)4+ conf (5)Based on reported results in Table 1, we observe thefollowing:1. We note that the classification by k-NN providesbetter results as expected the recognition rateof double smearing i.e., segmentation without re-grouping. In contrast, methods based on confi-dence degrades performance. This is mainly dueto the fact only local vicinity (a single neighbour)is taken into account, that makes misclassificationpossible.2. In our study, we have found that handwrittenmixes with printed and other cases where groupingchanges the isolated handwritten annotations la-bel (e.g., a figure or a symbol). In this situation,we are required more contextual information in-cluding the better interpretation, which is beyondthe scope of current work.Table 1. Evaluation of four grouping methods.Recognition rate Hand. Print. Noise AverageDouble smearing 96.1 98.5 35.7 89.48k-NN 93.4 98.3 27.3 89.54kNN with constraints 99.3 99.0 27.9 90.68Gaussian confidence 94.5 97.7 27.2 87.49Poly confidence2 & 4 93.5 97.7 14.2 86.06On the whole, for visual understanding, we providea few examples of H&P text separation in Fig. 4. Fur-thermore, Fig. 5 shows a comparison between four clas-sifiers: SVM, Tree C4.5 (J48 implementation), REP-Tree and NN. In this comparison, we have found thatFigure 4. A few examples of H&P text separation, illustrating the robustness of the proposed approach.Figure 5. Evaluation of four classifiersSVM performs the best, by providing marginal differ-ence with NN. This means that MLP can still be ap-plied.4 ConclusionIn this paper, we have presented an approach toseparate handwritten and machine-printed text froma scanned document in addition to the noise. Themethod is based on a double smearing technique to ob-tain the pseudo-words. These serve as a basis for clas-sification. For these words, descriptors are extractedwhere they all have a linear complexity with the num-ber of pixels. Descriptors are then fed into a multi-classSVM with a Gaussian kernel which provides the firstlabel of each pseudo-word. A second analysis is carriedout by studying the local vicinity of each pseudo-wordthat can change label if the neighbours are from an-other class. This integration allows context to correctseveral possible errors. In our test, we have found thatthe method is k-NN with constraints where kd-tree hasbeen used.Considering our small learning database, the resultsare fairly encouraging. This will certainly forecast anappropriate commercial application. Based on our re-ported results, a long-term approach about incremen-tal learning is one of the further issues.AcknowledgementsThe authors would like to thank Didier Grzejszczakand Yves Rangoni for their help and Hervé Locteaufor his high level implementation to make the conceptcommercially useful. The work has been done duringtheir stay with us at LORIA - Université de Lorraine.References[1] Kang W.-X., Yang Q.-Q., Liang R.-P., The Compara-tive Research on Image Segmentation Algorithms, in:Proceedings of the ECTS, 2009, pp. 703-707.[2] S. Chanda, K. Franke, and U. Pal, Structural hand-written and machine print classification for sparse con-tent and arbitrary oriented document fragments, in:Proceedings of SAC, 2010, pp. 18-22.[3] Peng, X., Setlur, S., Govindaraju, V., and Sitaram,R., Handwritten text separation from annotated ma-chine printed documents using markov random fields.IJDAR, 16(1): 1-16, 2011.[4] Kandan R., N.Kumar R., Arvind K. R., Ramakrish-nan A. G., A robust two level classification algorithmfor text localization in documents, in: Proceedings ofthe Advances in visual computing, 2007, pp. 96-105.[5] Zheng Y., Li H., Doermann D., The segmentation andidentification of handwriting in noisy document im-ages, in: Proceedings of DAS, 2002, pp. 95-105.[6] Chinnasarn K., Rangsanseri Y., Thitimajshima P., Re-moving Salt-and-Pepper Noise in Text/Graphics Im-ages, The Asia-Pacific Conference on Circuits and Sys-tems, 1998, pp. 459-462.[7] van Beusekom J., Shafait F., Breuel T. M., Combinedorientation and skew detection using geometric text-line modeling, in: Proceedings of the ICDAR, 2010,pp. 79-92.[8] da Silva L. F., Conci A., Sanchez A., Automatic Dis-crimination between Printed and Handwritten Text inDocuments, in: Proceedings of the Brazilian Sympo-sium on CGIP, 2009, pp. 261-267.[9] V. N. Vapnik. The nature of statistical learning theory.Springer-Verlag New York, Inc., New York, NY, USA,1995.[10] Weston J., Watkins C., Multi-class Support VectorMachines, Technical report, Royal Holloway, Univer-sity of London, 1998.[11] Mayoraz E., Alpaydin E., Support Vector Machines forMulti-class Classification, in: Proceedings of the ANN,1999, pp. 833-842.[12] Shafait F., Keysers D., Breuel T. M., PerformanceEvaluation and Benchmarking of Six-Page Segmenta-tion Algorithms, IEEE-PAMI, 30(6):941-954, 2008.

Handwritten and Printed Text Separation in Real Document

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server