7 research outputs found

    A Lexicon of Connected Components for Arabic Optical Text Recognition

    Get PDF
    Arabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components. A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon

    Une méthode rapide de reconnaissance de l'écriture arabe manuscrite

    Get PDF
    Nous décrivons une méthode rapide de reconnaissance offline de l'écriture arabe manuscrite. Le problème de la reconnaissance est découpé en différentes sous-tâches et distribué à plusieurs agents. La segmentation des mots arabes en graphèmes est effectuée en analysant le contour supérieur des composantes connexes qui nous sert de signal utile pour la détection des points de segmentation primaires PSP. Une analyse locale détermine les points de segmentation décisifs PSD. Les primitives se rapportant à chaque mot sont analysés dans un premier module de reconnaissance où la décision est donnée par maximum de vraisemblance. Un deuxième module effectue l'étiquétage des observations HMM par rapport aux caractères. Les résultats des deux modules sont analysés

    Arabic Text Steganography Using Multiple Diacritics

    Get PDF
    Steganography techniques are concerned with hiding the existence of data in other cover media. Today, text steganography has become particularly popular. This paper presents a new idea for using Arabic text in steganography. The main idea is to superimpose multiple invisible instances of Arabic diacritic marks over each other. This is possible because of the way in which diacritic marks are displayed on screen and printed to paper. Two approaches and several scenarios are proposed. The main advantage is in terms of the arbitrary capacity. The approach was compared to other similar methods in terms of overhead on capacity. It was shown to exceed any of these easily, provided the correct scenario is chosen

    Arabic Text Steganography Using Multiple Diacritics

    Get PDF
    Steganography techniques are concerned with hiding the existence of data in other cover media. Today, text steganography has become particularly popular. This paper presents a new idea for using Arabic text in steganography. The main idea is to superimpose multiple invisible instances of Arabic diacritic marks over each other. This is possible because of the way in which diacritic marks are displayed on screen and printed to paper. Two approaches and several scenarios are proposed. The main advantage is in terms of the arbitrary capacity. The approach was compared to other similar methods in terms of overhead on capacity. It was shown to exceed any of these easily, provided the correct scenario is chosen

    Character Segmentation of Sindhi, an Arabic Style Scripting Language, using Height Profile Vector,

    Get PDF
    Abstract: In this paper, a problem of sub-word segmentation of printed Sindhi, an Arabic style scripting language, into characters is addressed. Printed or handwritten Sindhi text is cursive in nature. In the cursive writing, mostly the subsequent characters in a word are joined with each other. In the proposed segmentation algorithm, first of all, Height Profile Vector (HPV) of thinned primary stroke of a sub-word is calculated and analyzed for the segmentation into its constituent characters. The number and locations of possible segmentation points (PSP) are determined. The number of PSPs gives a rough estimation of the number of characters in the sub-word. The data around the last PSP is further analyzed to determine the exact number of characters in the sub-word. As the characters' set of Sindhi is the superset set of Arabic characters' set hence the proposed segmentation algorithm may be used for the segmentation of text written in other Arabic scripting languages

    Geometric correction of historical Arabic documents

    Get PDF
    Geometric deformations in historical documents significantly influence the success of both Optical Character Recognition (OCR) techniques and human readability. They may have been introduced at any time during the life cycle of a document, from when it was first printed to the time it was digitised by an imaging device. This Thesis focuses on the challenging domain of geometric correction of Arabic historical documents, where background research has highlighted that existing approaches for geometric correction of Latin-script historical documents are not sensitive to the characteristics of text in Arabic documents and therefore cannot be applied successfully. Text line segmentation and baseline detection algorithms have been investigated to propose a new more suitable one for warped Arabic historical document images. Advanced ideas for performing dewarping and geometric restoration on historical Arabic documents, as dictated by the specific characteristics of the problem have been implemented.In addition to developing an algorithm to detect accurate baselines of historical printed Arabic documents the research also contributes a new dataset consisting of historical Arabic documents with different degrees of warping severity.Overall, a new dewarping system, the first for Historical Arabic documents, has been developed taking into account both global and local features of the text image and the patterns of the smooth distortion between text lines. By using the results of the proposed line segmentation and baseline detection methods, it can cope with a variety of distortions, such as page curl, arbitrary warping and fold