228 research outputs found

    GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION

    Get PDF
    The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the increased interest in processing multilingual sources, however, there is a tremendous need to be able to rapidly generate data in new languages and scripts, without the need to develop specialized systems. We have developed a system, which uses language support of the MS Windows operating system combined with custom print drivers to render tiff images simultaneously with windows Enhanced Metafile directives. The metafile information is parsed to generate zone, line, word, and character ground truth including location, font information and content in any language supported by Windows. The resulting images can be physically or synthetically degraded by our degradation modules, and used for training and evaluating Optical Character Recognition (OCR) systems. Our document image degradation methodology incorporates several often-encountered types of noise at the page and pixel levels. Examples of OCR evaluation and synthetically degraded document images are given to demonstrate the effectiveness

    Arabic Handwriting Synthesis

    Get PDF
    Training and testing data for optical character recognition are cumbersome to obtain. If large amounts of data can be produced from small amounts, much time and effort can be saved. This paper presents an approach to synthesize Arabic handwriting. We segment word images into labeled characters and then use these in synthesizing arbitrary words. The synthesized text should look natural; hence, we define some criteria to decide on what is acceptable as natural-looking. The text that is synthesized by using the natural-looking constrain is compared to text that is synthesized without using the natural-looking constrain for evaluation

    Application of the Hidden Markov Model for Innovative Projects "Viability" Analysis

    Get PDF
    This master thesis deals with determining of innovative projects "viability". "Viability" is the probability of innovative project being implemented. Hidden Markov Models are used for evaluation of this factor. The problem of determining parameters of model, which produce given data sequence with the highest probability, are solving in this research. Data about innovative projects contained in reports of Russian programs "UMNIK", "START" and additional data obtained during study are used as input data for determining of model parameters. The Baum-Welch algorithm which is one implementation of expectation-maximization algorithm is used at this research for calculating model parameters. At the end part of the master thesis mathematical basics for practical implementation are given (in particular mathematical description of the algorithm and implementation methods for Markov models)

    Réseaux Bayésiens Dynamiques pour la reconnaissance des caractères imprimés dégradés

    Get PDF
    Le but de ce travail est de présenter une nouvelle approche pour la reconnaissance des caractères imprimés dégradés. Notre approche consiste à construire deux chaînes de Markov cachées [HMMs] à l'aide des réseaux bayésiens dynamiques, nommées HMM vertical et horizontal. Un HMM-vertical (respectivement HMM-horizontal) est un modèle qui prend pour séquence d'entrée les colonnes de pixels du caractère (respectivement les lignes de pixels). Nous couplons ensuite ces chaînes suivant deux modèles de couplage en utilisant les réseaux bayésiens dynamiques. Les résultats expérimentaux montrent que les modèles de couplage augmentent le taux de reconnaissance de 8 % à 10 % relativement au système de reconnaissance utilisant les modèles non couplés

    Search of Method for Analyzing "Viability" of Innovative Projects

    Get PDF
    Questions of "viability" evaluation of innovation projects are considered in this article. As a method of evaluation Hidden Markov Models are used. Problem of determining model parameters, which reproduce test data with highest accuracy are solving. For training the model statistical data on the implementation of innovative projects are used. Baum-Welch algorithm is used as a training algorithm

    Columbia Chronicle (05/18/1998)

    Get PDF
    Student newspaper from May 18, 1998 entitled The Chronicle of Columbia College Chicago. This issue is 20 pages and is listed as Volume XXXI, Number 24. Cover story: Textboooks - Expensive and unwnated Editor in Chief: Merma Ayihttps://digitalcommons.colum.edu/cadc_chronicle/1399/thumbnail.jp

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information
    • …
    corecore