58 research outputs found

    Benchmark Classification of Handwritten Dataset by New Operator

    Get PDF
    In recent years, many new classifiers and feature extraction algorithms were proposed and tested on various OCR databases and these techniques were used in wide applications. Various systematic papers and inventions in OCR were reported in the literature. We can say that OCR is one of the most important and active research areas in the pattern recognition. Today, research OCR is dealing with diverse a character of complex problems. Important research in OCR includes the text degraded (heavy noise) and analysis/recognition of complex documents (including texts, images, graphs, tables and video documents). In this proposed system we are suing a new operator Recognition of Devnagari handwritten Characters one of the biggest problem in present scenario. Devnagari characters are not recognized efficiently and truthfully by electronic device. Many researchers and algorithm have been proposed for recognizing of characters. For recognizing of characters, many processes have to be performed but no single technique or algorithm can perform that recognition and give more accurate result. objective of this dissertation work is to propose a new operator, the name of this operator is Kirsch Operator and algorithm for getting accurate result

    A Study of Techniques and Challenges in Text Recognition Systems

    Get PDF
    The core system for Natural Language Processing (NLP) and digitalization is Text Recognition. These systems are critical in bridging the gaps in digitization produced by non-editable documents, as well as contributing to finance, health care, machine translation, digital libraries, and a variety of other fields. In addition, as a result of the pandemic, the amount of digital information in the education sector has increased, necessitating the deployment of text recognition systems to deal with it. Text Recognition systems worked on three different categories of text: (a) Machine Printed, (b) Offline Handwritten, and (c) Online Handwritten Texts. The major goal of this research is to examine the process of typewritten text recognition systems. The availability of historical documents and other traditional materials in many types of texts is another major challenge for convergence. Despite the fact that this research examines a variety of languages, the Gurmukhi language receives the most focus. This paper shows an analysis of all prior text recognition algorithms for the Gurmukhi language. In addition, work on degraded texts in various languages is evaluated based on accuracy and F-measure

    uTHCD: A New Benchmarking for Tamil Handwritten OCR

    Full text link
    Handwritten character recognition is a challenging research in the field of document image analysis over many decades due to numerous reasons such as large writing styles variation, inherent noise in data, expansive applications it offers, non-availability of benchmark databases etc. There has been considerable work reported in literature about creation of the database for several Indic scripts but the Tamil script is still in its infancy as it has been reported only in one database [5]. In this paper, we present the work done in the creation of an exhaustive and large unconstrained Tamil Handwritten Character Database (uTHCD). Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Offline samples were collected by asking volunteers to write samples on a form inside a specified grid. For online samples, we made the volunteers write in a similar grid using a digital writing pad. The samples collected encompass a vast variety of writing styles, inherent distortions arising from offline scanning process viz stroke discontinuity, variable thickness of stroke, distortion etc. Algorithms which are resilient to such data can be practically deployed for real time applications. The samples were generated from around 650 native Tamil volunteers including school going kids, homemakers, university students and faculty. The isolated character database will be made publicly available as raw images and Hierarchical Data File (HDF) compressed file. With this database, we expect to set a new benchmark in Tamil handwritten character recognition and serve as a launchpad for many avenues in document image analysis domain. Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.Comment: 30 pages, 18 figures, in IEEE Acces

    A novel image matching approach for word spotting

    Get PDF
    Word spotting has been adopted and used by various researchers as a complementary technique to Optical Character Recognition for document analysis and retrieval. The various applications of word spotting include document indexing, image retrieval and information filtering. The important factors in word spotting techniques are pre-processing, selection and extraction of proper features and image matching algorithms. The Correlation Similarity Measure (CORR) algorithm is considered to be a faster matching algorithm, originally defined for finding similarities between binary patterns. In the word spotting literature the CORR algorithm has been used successfully to compare the GSC binary features extracted from binary word images, i.e., Gradient, Structural and Concavity (GSC) features. However, the problem with this approach is that binarization of images leads to a loss of very useful information. Furthermore, before extracting GSC binary features the word images must be skew corrected and slant normalized, which is not only difficult but in some cases impossible in Arabic and modified Arabic scripts. We present a new approach in which the Correlation Similarity Measure (CORR) algorithm has been used innovatively to compare Gray-scale word images. In this approach, binarization of images, skew correction and slant normalization of word images are not required at all. The various features, i.e., projection profiles, word profiles and transitional features are extracted from the Gray-scale word images and converted into their binary equivalents, which are compared via CORR algorithm with greater speed and higher accuracy. The experiments have been conducted on Gray-scale versions of newly created handwritten databases of Pashto and Dari languages, written in modified Arabic scripts. For each of these languages we have used 4599 words relating to 21 different word classes collected from 219 writers. The average precision rates achieved for Pashto and Dari languages were 93.18 % and 93.75 %, respectively. The time taken for matching a pair of images was 1.43 milli-seconds. In addition, we will present the handwritten databases for two well-known Indo- Iranian languages, i.e., Pashto and Dari languages. These are large databases which contain six types of data, i.e., Dates, Isolated Digits, Numeral Strings, Isolated Characters, Different Words and Special Symbols, written by native speakers of the corresponding languages

    Advances in Character Recognition

    Get PDF
    This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject
    • …
    corecore