16 research outputs found
A review of Arabic text recognition dataset
Building a robust Optical Character Recognition (OCR) system for languages, such as Arabic with cursive scripts,
has always been challenging. These challenges increase if the text contains diacritics of different sizes for
characters and words. Apart from the complexity of the used font, these challenges must be addressed in
recognizing the text of the Holy Quran. To solve these challenges, the OCR system would have to undergo
different phases. Each problem would have to be addressed using different approaches, thus, researchers are
studying these challenges and proposing various solutions. This has motivate this study to review Arabic OCR
dataset because the dataset plays a major role in determining the nature of the OCR systems. State-of-the-art
approaches in segmentation and recognition are discovered with the implementation of Recurrent Neural
Networks (Long Short-Term Memory-LSTM and Gated Recurrent Unit-GRU) with the use of the Connectionist
Temporal Classification (CTC). This also includes deep learning model and implementation of GRU in the Arabic
domain. This paper has contribute in profiling the Arabic text recognition dataset thus determining the nature of
OCR system developed and has identified research direction in building Arabic text recognition dataset
Sub-sampling Approach for Unconstrained Arabic Scene Text Analysis by Implicit Segmentation based Deep Learning Classifier
The text extraction from the natural scene image is still a cumbersome task to perform. This paper presents a novel contribution and suggests the solution for cursive scene text analysis notably recognition of Arabic scene text appeared in the unconstrained environment. The hierarchical sub-sampling technique is adapted to investigate the potential through sub-sampling the window size of the given scene text sample. The deep learning architecture is presented by considering the complexity of the Arabic script. The conducted experiments present 96.81% accuracy at the character level. The comparison of the Arabic scene text with handwritten and printed data is outlined as well
A Study of Techniques and Challenges in Text Recognition Systems
The core system for Natural Language Processing (NLP) and digitalization is Text Recognition. These systems are critical in bridging the gaps in digitization produced by non-editable documents, as well as contributing to finance, health care, machine translation, digital libraries, and a variety of other fields. In addition, as a result of the pandemic, the amount of digital information in the education sector has increased, necessitating the deployment of text recognition systems to deal with it. Text Recognition systems worked on three different categories of text: (a) Machine Printed, (b) Offline Handwritten, and (c) Online Handwritten Texts. The major goal of this research is to examine the process of typewritten text recognition systems. The availability of historical documents and other traditional materials in many types of texts is another major challenge for convergence. Despite the fact that this research examines a variety of languages, the Gurmukhi language receives the most focus. This paper shows an analysis of all prior text recognition algorithms for the Gurmukhi language. In addition, work on degraded texts in various languages is evaluated based on accuracy and F-measure
Evaluation of handwritten Urdu text by integration of MNIST dataset learning experience
© 2019 IEEE. The similar nature of patterns may enhance the learning if the experience they attained during training is utilized to achieve maximum accuracy. This paper presents a novel way to exploit the transfer learning experience of similar patterns on handwritten Urdu text analysis. The MNIST pre-trained network is employed by transferring it's learning experience on Urdu Nastaliq Handwritten Dataset (UNHD) samples. The convolutional neural network is used for feature extraction. The experiments were performed using deep multidimensional long short term (MDLSTM) memory networks. The obtained result shows immaculate performance on number of experiments distinguished on the basis of handwritten complexity. The result of demonstrated experiments show that pre-trained network outperforms on subsequent target networks which enable them to focus on a particular feature learning. The conducted experiments presented astonishingly good accuracy on UNHD dataset
Urdu Handwritten Characters Data Visualization and Recognition Using Distributed Stochastic Neighborhood Embedding and Deep Network
This study was supported by the China University of Petroleum-Beijing and Fundamental Research Funds for Central Universities under Grant no. 2462020YJRC001.Peer reviewedPublisher PD
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents
In this paper, we propose a novel approach to address the challenges of
printed Urdu text recognition using high-resolution, multi-scale semantic
feature extraction. Our proposed UTRNet architecture, a hybrid CNN-RNN model,
demonstrates state-of-the-art performance on benchmark datasets. To address the
limitations of previous works, which struggle to generalize to the intricacies
of the Urdu script and the lack of sufficient annotated real-world data, we
have introduced the UTRSet-Real, a large-scale annotated real-world dataset
comprising over 11,000 lines and UTRSet-Synth, a synthetic dataset with 20,000
lines closely resembling real-world and made corrections to the ground truth of
the existing IIITH dataset, making it a more reliable resource for future
research. We also provide UrduDoc, a benchmark dataset for Urdu text line
detection in scanned documents. Additionally, we have developed an online tool
for end-to-end Urdu OCR from printed documents by integrating UTRNet with a
text detection model. Our work not only addresses the current limitations of
Urdu OCR but also paves the way for future research in this area and
facilitates the continued advancement of Urdu OCR technology. The project page
with source code, datasets, annotations, trained models, and online tool is
available at abdur75648.github.io/UTRNet.Comment: Accepted at The 17th International Conference on Document Analysis
and Recognition (ICDAR 2023
A Novel Dataset for English-Arabic Scene Text Recognition (EASTR)-42K and Its Evaluation Using Invariant Feature Extraction on Detected Extremal Regions
© 2019 IEEE. The recognition of text in natural scene images is a practical yet challenging task due to the large variations in backgrounds, textures, fonts, and illumination. English as a secondary language is extensively used in Gulf countries along with Arabic script. Therefore, this paper introduces English-Arabic scene text recognition 42K scene text image dataset. The dataset includes text images appeared in English and Arabic scripts while maintaining the prime focus on Arabic script. The dataset can be employed for the evaluation of text segmentation and recognition task. To provide an insight to other researchers, experiments have been carried out on the segmentation and classification of Arabic as well as English text and report error rates like 5.99% and 2.48%, respectively. This paper presents a novel technique by using adapted maximally stable extremal region (MSER) technique and extracts scale-invariant features from MSER detected region. To select discriminant and comprehensive features, the size of invariant features is restricted and considered those specific features which exist in the extremal region. The adapted MDLSTM network is presented to tackle the complexities of cursive scene text. The research on Arabic scene text is in its infancy, thus this paper presents benchmark work in the field of text analysis
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information