707 research outputs found

    IMPROVING THE EFFICIENCY OF TESSERACT OCR ENGINE

    Get PDF
    This project investigates the principles of optical character recognition used in the Tesseract OCR engine and techniques to improve its efficiency and runtime. Optical character recognition (OCR) method has been used in converting printed text into editable text in various applications over a variety of devices such as Scanners, computers, tablets etc. But now Mobile is taking over the computer in all the domains but OCR still remains one not so conquered field. So programmers need to improve the efficiency of the OCR system to make it run properly on Mobile devices. This paper focuses on improving the Tesseract OCR efficiency for Hindi language to run on Mobile devices as there a not many applications for the same and most of them are either not open source or not for mobile devices. Improving Hindi text extraction will increase Tesseract\u27s performance for Mobile phone apps and in turn will draw developers to contribute towards Hindi OCR . This paper presents a preprocessing technique being applied to the Tesseract Engine to improve the recognition of the characters keeping the runtime low. Hence the system runs smoothly and efficiently on mobile devices(Android) as it does on the bigger machines

    Anonimização automatizada de contratos jurídicos em português

    Get PDF
    With the introduction of the General Data Protection Regulation, many organizations were left with a large amount of documents containing public information that should have been private. Given that we are talking about quite large quantities of documents, it would be a waste of resources to edit them manually. The objective of this dissertation is the development of an autonomous system for the anonymization of sensitive information in contracts written in Portuguese. This system uses Google Cloud Vision, an API to apply the OCR tecnology, to extract any text present in a document. As there is a possibility that these documents are poorly readable, an image pre-processing is done using the OpenCV library to increase the readability of the text present in the images. Among others, the application of binarization, skew correction and noise removal algorithms were explored. Once the text has been extracted, it will be interpreted by an NLP library. In this project we chose to use spaCy, which contains a Portuguese pipeline trained with the WikiNer and UD Portuguese Bosque datasets. This library not only allows a very complete identification of the part of speech, but also contains four different categories of named entity recognition in its model. In addition to the processing carried out using the spaCy library, and since the Portuguese language does not have a great support, some rule-based algorithms were implemented in order to identify other types of more specific information such as identification number and postal codes. In the end, the information considered confidential is covered by a black rectangle drawn by OpenCV through the coordinates returned by Google Cloud Vision OCR and a new PDF is generated.Com a introdução do Regulamento Geral de Proteção de Dados, muitas organizações ficaram com uma grande quantidade de documentos contendo informações públicas que deveriam ser privadas. Dado que estamos a falar de quantidades bastante elevadas de documentos, seria um desperdício de recursos editá-los manualmente. O objetivo desta dissertação é o desenvovimento de um sistema autónomo de anonimização de informação sensível em contratos escritos na língua Portuguesa. Este sistema utiliza a Google Cloud Vision, uma API de OCR, para extrair qualquer texto presente num documento. Como existe a possibilidade desses documentos serem pouco legíveis, é feito um pré-processamento de imagem através da biblioteca OpenCV para aumentar a legibilidade do texto presente nas imagens. Entre outros, foi explorada a aplicação de algoritmos de binarização, correção da inclinação e remoção de ruído. Uma vez extraído o texto, este será interpretado por uma biblioteca de nlp, neste projeto optou-se pelo uso do spaCy, que contém um pipeline português treinado com os conjuntos de dados WikiNer e UD Portuguese Bosque. Esta biblioteca não permite apenas uma identificação bastante completa da parte do discurso, mas também contém quatro categorias diferentes de reconhecimento de entidade nomeada no seu modelo. Para além do processamento efetuado com o recurso à biblioteca de spaCy, e uma vez que a língua portuguesa não tem um grande suporte, foram implementados alguns algoritmos baseados em regras de modo a identificar outros tipos de informação mais especifica como número de identificação e códigos postais. No final, as informações consideradas confidenciais são cobertas por um retângulo preto desenhado pelo OpenCV através das coordenadas retornadas pelo OCR do Google Cloud Vision e será gerado um novo PDF.Mestrado em Engenharia de Computadores e Telemátic

    Character Recognition

    Get PDF
    Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field

    Preprocessing for Images Captured by Cameras

    Get PDF

    Advances in Character Recognition

    Get PDF
    This book presents advances in character recognition, and it consists of 12 chapters that cover wide range of topics on different aspects of character recognition. Hopefully, this book will serve as a reference source for academic research, for professionals working in the character recognition field and for all interested in the subject

    Mathematical Formula Recognition and Automatic Detection and Translation of Algorithmic Components into Stochastic Petri Nets in Scientific Documents

    Get PDF
    A great percentage of documents in scientific and engineering disciplines include mathematical formulas and/or algorithms. Exploring the mathematical formulas in the technical documents, we focused on the mathematical operations associations, their syntactical correctness, and the association of these components into attributed graphs and Stochastic Petri Nets (SPN). We also introduce a formal language to generate mathematical formulas and evaluate their syntactical correctness. The main contribution of this work focuses on the automatic segmentation of mathematical documents for the parsing and analysis of detected algorithmic components. To achieve this, we present a synergy of methods, such as string parsing according to mathematical rules, Formal Language Modeling, optical analysis of technical documents in forms of images, structural analysis of text in images, and graph and Stochastic Petri Net mapping. Finally, for the recognition of the algorithms, we enriched our rule based model with machine learning techniques to acquire better results

    Skip Trie Matching for Real-Time OCR Output Error Corrrection on Smartphones

    Get PDF
    Many Visually Impaired individuals are managing their daily activities with the help of smartphones. While there are many vision-based mobile applications to identify products, there is a relative dearth of applications for extracting useful nutrition information. In this report, we study the performance of existing OCR systems available for the Android platform, and choose the best to extract the nutrition facts information from U.S grocery store packages. We then provide approaches to improve the results of text strings produced by the Tesseract OCR engine on image segments of nutrition tables automatically extracted by an Android 2.3.6 smartphone application using real-time video streams of grocery products. We also present an algorithm, called Skip Trie Matching (STM), for real-time OCR output error correction on smartphones. The algorithm’s performance is compared with Apache Lucene’s spell checker. Our evaluation indicates that the average run time of the STM algorithm is lower than Lucene’s. (68 pages
    corecore