11 research outputs found
Optical Character Recognition of Amharic Documents
In Africa around 2,500 languages are spoken. Some of these languages have their own indigenous scripts. Accordingly, there is a bulk of printed documents available in libraries, information centers, museums and offices. Digitization of these documents enables to harness already available information technologies to local information needs and developments. This paper presents an Optical Character Recognition (OCR) system for converting digitized documents in local languages. An extensive literature survey reveals that this is the first attempt that report the challenges towards the recognition of indigenous African scripts and a possible solution for Amharic script. Research in the recognition of African indigenous scripts faces major challenges due to (i) the use of large number characters in the writing and (ii) existence of large set of visually similar characters. In this paper, we propose a novel feature extraction scheme using principal component and linear discriminant analysis, followed by a decision directed acyclic graph based support vector machine classifier. Recognition results are presented on real-life degraded documents such as books, magazines and newspapers to demonstrate the performance of the recognizer
A novel image matching approach for word spotting
Word spotting has been adopted and used by various researchers as a complementary technique to Optical Character Recognition for document analysis and retrieval. The various applications of word spotting include document indexing, image retrieval and information filtering. The important factors in word spotting techniques are pre-processing, selection and extraction of proper features and image matching algorithms. The Correlation Similarity Measure (CORR) algorithm is considered to be a faster matching algorithm, originally defined for finding similarities between binary patterns. In the word spotting literature the CORR algorithm has been used successfully to compare the GSC binary features extracted from binary word images, i.e., Gradient, Structural and Concavity (GSC) features. However, the problem with this approach is that binarization of images leads to a loss of very useful information. Furthermore, before extracting GSC binary features the word images must be skew corrected and slant normalized, which is not only difficult but in some cases impossible in Arabic and modified Arabic scripts. We present a new approach in which the Correlation Similarity Measure (CORR) algorithm has been used innovatively to compare Gray-scale word images. In this approach, binarization of images, skew correction and slant normalization of word images are not required at all. The various features, i.e., projection profiles, word profiles and transitional features are extracted from the Gray-scale word images and converted into their binary equivalents, which are compared via CORR algorithm with greater speed and higher accuracy. The experiments have been conducted on Gray-scale versions of newly created handwritten databases of Pashto and Dari languages, written in modified Arabic scripts. For each of these languages we have used 4599 words relating to 21 different word classes collected from 219 writers. The average precision rates achieved for Pashto and Dari languages were 93.18 % and 93.75 %, respectively. The time taken for matching a pair of images was 1.43 milli-seconds. In addition, we will present the handwritten databases for two well-known Indo- Iranian languages, i.e., Pashto and Dari languages. These are large databases which contain six types of data, i.e., Dates, Isolated Digits, Numeral Strings, Isolated Characters, Different Words and Special Symbols, written by native speakers of the corresponding languages
Advanced document data extraction techniques to improve supply chain performance
In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
Advances in Image Processing, Analysis and Recognition Technology
For many decades, researchers have been trying to make computers’ analysis of images as effective as the system of human vision is. For this purpose, many algorithms and systems have previously been created. The whole process covers various stages, including image processing, representation and recognition. The results of this work can be applied to many computer-assisted areas of everyday life. They improve particular activities and provide handy tools, which are sometimes only for entertainment, but quite often, they significantly increase our safety. In fact, the practical implementation of image processing algorithms is particularly wide. Moreover, the rapid growth of computational complexity and computer efficiency has allowed for the development of more sophisticated and effective algorithms and tools. Although significant progress has been made so far, many issues still remain, resulting in the need for the development of novel approaches
Improving Search via Named Entity Recognition in Morphologically Rich Languages – A Case Study in Urdu
University of Minnesota Ph.D. dissertation. February 2018. Major: Computer Science. Advisors: Vipin Kumar, Blake Howald. 1 computer file (PDF); xi, 236 pages.Search is not a solved problem even in the world of Google and Bing's state of the art engines. Google and similar search engines are keyword based. Keyword-based searching suffers from the vocabulary mismatch problem -- the terms in document and user's information request don't overlap. For example, cars and automobiles. This phenomenon is called synonymy. Similarly, the user's term may be polysemous -- a user is inquiring about a river's bank, but documents about financial institutions are matched. Vocabulary mismatch exacerbated when the search occurs in Morphological Rich Language (MRL). Concept search techniques like dimensionality reduction do not improve search in Morphological Rich Languages. Names frequently occur news text and determine the "what," "where," "when," and "who" in the news text. Named Entity Recognition attempts to recognize names automatically in text, but these techniques are far from mature in MRL, especially in Arabic Script languages. Urdu is one the focus MRL of this dissertation among Arabic, Farsi, Hindi, and Russian, but it does not have the enabling technologies for NER and search. A corpus, stop word generation algorithm, a light stemmer, a baseline, and NER algorithm is created so the NER-aware search can be accomplished for Urdu. This dissertation demonstrates that NER-aware search on Arabic, Russian, Urdu, and English shows significant improvement over baseline. Furthermore, this dissertation highlights the challenges for researching in low-resource MRL languages
Introduction to Development Engineering
This open access textbook introduces the emerging field of Development Engineering and its constituent theories, methods, and applications. It is both a teaching text for students and a resource for researchers and practitioners engaged in the design and scaling of technologies for low-resource communities. The scope is broad, ranging from the development of mobile applications for low-literacy users to hardware and software solutions for providing electricity and water in remote settings. It is also highly interdisciplinary, drawing on methods and theory from the social sciences as well as engineering and the natural sciences. The opening section reviews the history of “technology-for-development” research, and presents a framework that formalizes this body of work and begins its transformation into an academic discipline. It identifies common challenges in development and explains the book’s iterative approach of “innovation, implementation, evaluation, adaptation.” Each of the next six thematic sections focuses on a different sector: energy and environment; market performance; education and labor; water, sanitation and health; digital governance; and connectivity. These thematic sections contain case studies from landmark research that directly integrates engineering innovation with technically rigorous methods from the social sciences. Each case study describes the design, evaluation, and/or scaling of a technology in the field and follows a single form, with common elements and discussion questions, to create continuity and pedagogical consistency. Together, they highlight successful solutions to development challenges, while also analyzing the rarely discussed failures. The book concludes by reiterating the core principles of development engineering illustrated in the case studies, highlighting common challenges that engineers and scientists will face in designing technology interventions that sustainably accelerate economic development. Development Engineering provides, for the first time, a coherent intellectual framework for attacking the challenges of poverty and global climate change through the design of better technologies. It offers the rigorous discipline needed to channel the energy of a new generation of scientists and engineers toward advancing social justice and improved living conditions in low-resource communities around the world
Introduction to Development Engineering
This open access textbook introduces the emerging field of Development Engineering and its constituent theories, methods, and applications. It is both a teaching text for students and a resource for researchers and practitioners engaged in the design and scaling of technologies for low-resource communities. The scope is broad, ranging from the development of mobile applications for low-literacy users to hardware and software solutions for providing electricity and water in remote settings. It is also highly interdisciplinary, drawing on methods and theory from the social sciences as well as engineering and the natural sciences. The opening section reviews the history of “technology-for-development” research, and presents a framework that formalizes this body of work and begins its transformation into an academic discipline. It identifies common challenges in development and explains the book’s iterative approach of “innovation, implementation, evaluation, adaptation.” Each of the next six thematic sections focuses on a different sector: energy and environment; market performance; education and labor; water, sanitation and health; digital governance; and connectivity. These thematic sections contain case studies from landmark research that directly integrates engineering innovation with technically rigorous methods from the social sciences. Each case study describes the design, evaluation, and/or scaling of a technology in the field and follows a single form, with common elements and discussion questions, to create continuity and pedagogical consistency. Together, they highlight successful solutions to development challenges, while also analyzing the rarely discussed failures. The book concludes by reiterating the core principles of development engineering illustrated in the case studies, highlighting common challenges that engineers and scientists will face in designing technology interventions that sustainably accelerate economic development. Development Engineering provides, for the first time, a coherent intellectual framework for attacking the challenges of poverty and global climate change through the design of better technologies. It offers the rigorous discipline needed to channel the energy of a new generation of scientists and engineers toward advancing social justice and improved living conditions in low-resource communities around the world