918 research outputs found

    Separation of Overlapping and Touching Lines within Handwritten Arabic Documents

    Get PDF
    The original publication is available at www.springerlink.comInternational audienceIn this paper, we propose an approach for the separation of overlapping and touching lines within handwritten Arabic documents. Our approach is based on the morphology analysis of the terminal letters of Arabic words. Starting from 4 categories of possible endings, we use the angular variance to follow the connection and separate the endings. The proposed separation scheme has been evaluated on 100 documents contains 640 overlapping and touching occurrences reaching an accuracy of about 96.88%

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    Detection of Text Lines of Handwritten Arabic Manuscripts using Markov Decision Processes

    Get PDF
    In a character recognition systems, the segmentation phase is critical since the accuracy of the recognition depend strongly on it. In this paper we present an approach based on Markov Decision Processes to extract text lines from binary images of Arabic handwritten documents. The proposed approach detects the connected components belonging to the same line by making use of knowledge about features and arrangement of those components. The initial results show that the system is promising for extracting Arabic handwritten lines

    Segmentation of Arabic Handwritten Documents into Text Lines using Watershed Transform

    Get PDF
    A crucial task in character recognition systems is the segmentation of the document into text lines and especially if it is handwritten. When dealing with non-Latin document such as Arabic, the challenge becomes greater since in addition to the variability of writing, the presence of diacritical points and the high number of ascender and descender characters complicates more the process of the segmentation. To remedy with this complexity and even to make this difficulty an advantage since the focus is on the Arabic language which is semi-cursive in nature, a method based on the Watershed Transform technique is proposed. Tested on «Handwritten Arabic Proximity Datasets» a segmentation rate of 93% for a 95% of matching score is achieved

    Adaptive Algorithms for Automated Processing of Document Images

    Get PDF
    Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections. We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures. Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features. Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts

    Junction Point Detection And Identification Of Broken Character In Touching Arabic Handwritten Text Using Overlapping Set Theory

    Get PDF
    Touching characters are formed when two or more characters share the same space with each other. Therefore, segmentation of these touching character is very challenging research topic especially for handwritten Arabic degraded documents. This is one of the key issue in recognition of the handwritten Arabic text. In order to make the recognition system more effective segmentation of these touching handwritten Arabic characters is considered to be very important research area. In this research, a new method is proposed, which is used to identify the junction or common point of Arabic touching word image by applying overlapping or intersection set theory operation, which will help to trace the correct boundary of the touching characters, identify the broken characters and also segmented these touching handwritten text in an efficient way. The proposed method has been evaluated on Arabic touching handwritten characters taken from handwritten datasets. The results show the efficiency of the proposed method. The proposed method is applicable to both degraded handwritten documents and printed documents

    Segmentation of Touching Component in Arabic Manuscripts

    Get PDF
    International audience— Touching components are connection zones occurring between text-lines or words of the same line and are one of the problems that make unconstrained handwritten text segmentation greatly hard. In this paper, we propose a recognition based method to separate these components once localized in Arabic manuscript images. It first identifies, for a given touching component, a similar model stored in a dictionary with its correct segmentation, using shape context descriptor and an interpolation function. Then, it segment the touching component based on the distance from the midpoints of the identified model's parts. Tests are performed using a database of touching components and two metrics: Manhattan and Euclidean distances. Experimental results show the effectiveness of the proposed segmentation method

    Segmentation Of Touching Arabic Characters In Handwritten Documents By Overlapping Set Theory And Contour Tracing

    Get PDF
    Segmentation of handwritten words into characters is one of the challenging problem in the field of OCR. In presence of touching characters, make this problem more difficult and challenging. There are many obstacles/challenges in segmentation of touching Arabic handwritten text. Although researches are busy in solving the problem of segmentation of these touching characters but still there exist unsolved problems of segmentation of touching offline Arabic handwritten characters. This is due to large variety of characters and their shapes. So in this research, a new method for segmentation of touching Arabic Handwritten character has been developed. The main idea of the proposed method is to segment the touching characters by identifying the touching point by overlapping set theory and ending points of the Arabic word by applying some standard morphology operation methods. After identifying all the points, segmentation method is applied to trace the boundaries of characters to separate these touching characters. Experiments were conducted on touching characters taken from different data sets. The results show the accuracy of the proposed method

    General text line extraction approach based on locally orientation estimation

    Get PDF
    ISBN: 978-0-8194-7927-3International audienceThis paper presents a novel approach for the multi-oriented text line extraction from historical handwritten Arabic documents. Because of the multi-orientation of lines and their dispersion in the page, we use an image paving allowing us to progressively and locally determine the lines. The paving is initialized with a small window and then its size is corrected by extension until enough lines and connected components were found. We use the Snake for line extraction. Once the paving is established, the orientation is determined using the Wigner-Ville distribution on the histogram projection prole. This local orientation is then enlarged to limit the orientation in the neighborhood. Afterwards, the text lines are extracted locally in each zone basing on the follow-up of the baselines and the proximity of connected components. Finally, the connected components that overlap and touch in adjacent lines are separated. The morphology analysis of the terminal letters of Arabic words is here considered. The proposed approach has been experimented on 100 documents reaching an accuracy of about 98.6%
    • …
    corecore