Offline Pashto Characters Dataset for OCR Systems

Abstract

In computer vision and artificial intelligence, text recognition and analysis based on images play a key role in the text retrieving process. Enabling a machine learning technique to recognize handwritten characters of a specific language requires a standard dataset. Acceptable handwritten character datasets are available in many languages including English, Arabic, and many more. However, the lack of datasets for handwritten Pashto characters hinders the application of a suitable machine learning algorithm for recognizing useful insights. In order to address this issue, this study presents the first handwritten Pashto characters image dataset (HPCID) for the scientific research work. This dataset consists of fourteen thousand, seven hundred, and eighty-four samples - 336 samples for each of the 44 characters in the Pashto character dataset. Such samples of handwritten characters are collected on an A4-sized paper from different students of Pashto Department in University of Peshawar, Khyber Pakhtunkhwa, Pakistan. On total, 336 students and faculty members contributed in developing the proposed database accumulation phase. This dataset contains multisize, multifont, and multistyle characters and of varying structures

    Similar works