13 research outputs found
Development of Comprehensive Devnagari Numeral and Character Database for Offline Handwritten Character Recognition
In handwritten character recognition, benchmark database plays an important
role in evaluating the performance of various algorithms and the results
obtained by various researchers. In Devnagari script, there is lack of such
official benchmark. This paper focuses on the generation of offline benchmark
database for Devnagari handwritten numerals and characters. The present work
generated 5137 and 20305 isolated samples for numeral and character database,
respectively, from 750 writers of all ages, sex, education, and profession. The
offline sample images are stored in TIFF image format as it occupies less
memory. Also, the data is presented in binary level so that memory requirement
is further reduced. It will facilitate research on handwriting recognition of
Devnagari script through free access to the researchers.Comment: 5 pages, 8 figures, journal pape
Handwritten Character Recognition of South Indian Scripts: A Review
Handwritten character recognition is always a frontier area of research in
the field of pattern recognition and image processing and there is a large
demand for OCR on hand written documents. Even though, sufficient studies have
performed in foreign scripts like Chinese, Japanese and Arabic characters, only
a very few work can be traced for handwritten character recognition of Indian
scripts especially for the South Indian scripts. This paper provides an
overview of offline handwritten character recognition in South Indian Scripts,
namely Malayalam, Tamil, Kannada and Telungu.Comment: Paper presented on the "National Conference on Indian Language
Computing", Kochi, February 19-20, 2011. 6 pages, 5 figure
A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes
Latin has historically led the state-of-the-art in handwritten optical
character recognition (OCR) research. Adapting existing systems from Latin to
alpha-syllabary languages is particularly challenging due to a sharp contrast
between their orthographies. The segmentation of graphical constituents
corresponding to characters becomes significantly hard due to a cursive writing
system and frequent use of diacritics in the alpha-syllabary family of
languages. We propose a labeling scheme based on graphemes (linguistic segments
of word formation) that makes segmentation in-side alpha-syllabary words linear
and present the first dataset of Bengali handwritten graphemes that are
commonly used in an everyday context. The dataset contains 411k curated samples
of 1295 unique commonly used Bengali graphemes. Additionally, the test set
contains 900 uncommon Bengali graphemes for out of dictionary performance
evaluation. The dataset is open-sourced as a part of a public Handwritten
Grapheme Classification Challenge on Kaggle to benchmark vision algorithms for
multi-target grapheme classification. The unique graphemes present in this
dataset are selected based on commonality in the Google Bengali ASR corpus.
From competition proceedings, we see that deep-learning methods can generalize
to a large span of out of dictionary graphemes which are absent during
training. Dataset and starter codes at www.kaggle.com/c/bengaliai-cv19.Comment: 15 pages, 12 figures, 6 Tables, Submitted to CVPR-2
Advances in Character Recognition
This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject
Recognition of off-line arabic handwritten dates and numeral strings
In this thesis, we present an automatic recognition system for CENPARMI off-line Arabic handwritten dates collected from Arabic Nationalities. This system consists of modules that segment and recognize an Arabic handwritten date image. First, in the segmentation module, the system explicitly segments a date image into a sequence of basic constituents or segments. As a part of this module, a special sub-module was developed to over-segment any constituent that is a candidate for a touching pair. The proposed touching pair segmentation submodule has been tested on three different datasets of handwritten numeral touching pairs: The CENPARMI Arabic [6], Urdu, and Dari [24] datasets. The final recognition rates of 92.22%, 90.43%, and 86.10% were achieved for Arabic, Urdu and Dari, respectively. Afterwards, the segments are preprocessed and sent to the classification module. In this stage, feature vectors are extracted and then recognized by an isolated numeral classifier. This recognition system has been tested in five different isolated numeral databases: The CENPARMI Arabic [6], Urdu, Dari [24], Farsi, and Pashto databases with overall recognition rates of 97.29% 97.75%, 97.75%, 97.95% and 98.36%, respectively. Finally, a date post processing module is developed to improve the recognition results. This post processing module is used in two different stages. First, in the date stage, to verify that the segmentation/recognition output represents a valid date image and it chooses the best date format to be assigned to this image. Second, in the sub-field stage, to evaluate the values for the date three parts: day, month and year. Experiments on two different databases of Arabic handwritten dates: CENPARMI Arabic database [6] and the CENPARMI Arabic Bank Cheques database [7], show encouraging results with overall recognition rates of 85.05% and 66.49, respectively
Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts
This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize.
The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution.
Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead