Search CORE

5 research outputs found

Human Reading Based Strategies for off-line Arabic Word Recognition

Author: Belaïd Abdel
Choisy Christophe
Publication venue: HAL CCSD
Publication date: 27/09/2006
Field of study

International audienceThis paper summarizes some techniques proposed for off-line Arabic word recognition. The point of view developed here concerns the human reading favoring an interactive mechanism between global memorization and local checking making easier the recognition of complex scripts as Arabic. According to this consideration, some specific papers are analyzed and their strategies commente

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Arabic natural language processing: handwriting recognition

Author: Belaïd Abdel
Publication venue: HAL CCSD
Publication date: 16/12/2008
Field of study

International audienceThe automatic recognition of Arabic writing is a very young research discipline with very challenging and significant problems. Indeed, with the air of the Internet, of Multimedia, the recognition of Arabic is useful to contributing like its close disciplines, Latin writing recognition, speech recognition and Vision processing, in current applications around digital libraries, document security and in numerical data processing in general. Arabic is a Semitic language spoken and understood in various forms by millions of people throughout the Middle East and in Africa, and it is used by 234 million people worldwide. Furthermore, Arabic gave rise to several other alphabets like Farsi or Urdu increasing much the interest of this script. Farsi is the main language used in Iran and Afghanistan, and it is spoken by more than 110 million people, concerning also some people in Tajikistan, and Pakistan. Urdu is an Indo-Aryan language with about 104 million speakers. It is the national language of Pakistan and is closely related to Hindi, though a lot of Urdu vocabulary comes from Persian and Arabic, which is not the case for Hindi. Urdu has been written with a version of the Perso-Arabic script since the 12th century and is normally written in Nastaliq style

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

Author: Belaïd Abdel
Ben Cheikh Imen
Kacem Afef
Publication venue: HAL CCSD
Publication date: 07/12/2008
Field of study

International audienceThis paper introduces a novel approach for the recognition of a wide vocabulary of Arabic handwritten words. Note that there is an essential difference between the global and analytic approaches in pattern recognition. While the global approach is limited to reduced vocabulary, the analytic approach succeeds to recognize a wide vocabulary but meets the problems of word segmentation especially for Arabic. Combining the neuronal approach with some linguistic characteristics of the Arabic, it is expected that we become able to recognize better and to handle a large vocabulary of Arabic handwritten words. The proposed approach invokes two transparent neuronal networks, TNN_1 and TNN_2, to respectively recognize roots, schemes and the elements of conjugation from the structural primitives of the words. The approach was evaluated using real examples from a data base established for this purpose. The results are promising, and suggestions for improvements are proposed

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts

Author: Jamal Amani
Publication venue
Publication date: 30/07/2015
Field of study

Word segmentation is an important task for many methods that are related to document understanding especially word spotting and word recognition. Several approaches of word segmentation have been proposed for Latin-based languages while a few of them have been introduced for Arabic texts. The fact that Arabic writing is cursive by nature and unconstrained with no clear boundaries between the words makes the processing of Arabic handwritten text a more challenging problem. In this thesis, the design and implementation of an End-Shape Letter (ESL) based segmentation system for Arabic handwritten text is presented. This incorporates four novel aspects: (i) removal of secondary components, (ii) baseline estimation, (iii) ESL recognition, and (iv) the creation of a new off-line CENPARMI ESL database. Arabic texts include small connected components, also called secondary components. Removing these components can improve the performance of several systems such as baseline estimation. Thus, a robust method to remove secondary components that takes into consideration the challenges in the Arabic handwriting is introduced. The methods reconstruct the image based on some criteria. The results of this method were subsequently compared with those of two other methods that used the same database. The results show that the proposed method is effective. Baseline estimation is a challenging task for Arabic texts since it includes ligature, overlapping, and secondary components. Therefore, we propose a learning-based approach that addresses these challenges. Our method analyzes the image and extracts baseline dependent features. Then, the baseline is estimated using a classifier. Algorithms dealing with text segmentation usually analyze the gaps between connected components. These algorithms are based on metric calculation, finding threshold, and/or gap classification. We use two well-known metrics: bounding box and convex hull to test metric-based method on Arabic handwritten texts, and to include this technique in our approach. To determine the threshold, an unsupervised learning approach, known as the Gaussian Mixture Model, is used. Our ESL-based segmentation approach extracts the final letter of a word using rule-based technique and recognizes these letters using the implemented ESL classifier. To demonstrate the benefit of text segmentation, a holistic word spotting system is implemented. For this system, a word recognition system is implemented. A series of experiments with different sets of features are conducted. The system shows promising results

Concordia University Research Repository

Recommended from our members

A high level approach to Arabic sentence recognition

Author: Krayem AG
Publication venue
Publication date: 01/09/2013
Field of study

The aim of this work is to develop sentence recognition system inspired by the human reading process. Cognitive studies observed that the human tended to read a word as a whole at a time. He considers the global word shapes and uses contextual knowledge to infer and discriminate a word among other possible words. The sentence recognition system is a fully integrated system; a word level recogniser (baseline system) integrated with linguistic knowledge post-processing module. The presented baseline system is holistic word-based recognition approach characterised as probabilistic ranked task. The output of the system is multiple recognition hypotheses (N-best word lattice). The basic unit is the word rather than the character; it does not rely on any segmentation or require baseline detection. The considered linguistic knowledge to re-rank the output of the existing baseline system is the standard n-gram Statistical Language Models (SLMs). The candidates are re-ranked through exploiting phrase perplexity score. The system is an OCR system that depends on HMM models utilizing the HTK Toolkit. The baseline system supported by global transformation features extracted from binary word images. The adopted features' extraction technique is the block-based Discrete Cosine Transform (DCT) applied to the whole word image. Feature vectors extracted using block-based DCT with non-overlapping sub-block of size 8x8 pixels. The applied HMMs to the task are mono-model discrete one-dimensional HMMs (Bakis Model). A balanced actual scanned and synthetic database of word-image has been constructed to ensure an even distribution of word samples. The Arabic words are typewritten in five fonts having a size 14 points in a plain style. The statistical language models and lexicon words are extracted from The Holy Qur‟an. The systems are applied on word images with no overlap between the training and testing datasets. The actual scanned database is used to evaluate the word recogniser. The synthetic database is a large amount of data acquired for a reliable training of sentence recognition systems. This word recogniser evaluated in mono-font and multi-font contexts. The two types of word recogniser have been used to achieve a final recognition accuracy of99.30% and 73.47% in mono-font and multi-font, respectively. The achieved average accuracy by the sentence recogniser is 67.24% improved to 78.35% on average when using 5-gram post-processing. The complexity and accuracy of the post-processing module are evaluated and found that 4-gram is more suitable than 5-gram; it is much faster at an average improvement of 76.89%

Nottingham Trent Institutional Repository (IRep)