Search CORE

70 research outputs found

Detection and recognition of textual information from drug box images using deep learning and computer vision

Author
Publication venue
Publication date
Field of study

The scope of this thesis work is to implement an OCR pipeline, capable of detecting and recognizing text instances when an image is given as input. The pipeline is divided into two steps: a detector, which scope is to detect the regions where a text is present, and a recognizer, which scope is to recognize and read the detected words and numbers. The work was initially developed during the internship experience in the start-up PatchAI, now an Alira Health company. The application of the algorithm in this context is the recognition of textual information on drug boxes. The idea is to deploy such pipeline into an app support, in such a way it can be used by patients, who can take a picture of the box and receive information about the medicine, in particular its posology. Also the use of a vocal assistant that reads orally the recognized text is explored, being a interesting application for ederly or visually impaired people.The scope of this thesis work is to implement an OCR pipeline, capable of detecting and recognizing text instances when an image is given as input. The pipeline is divided into two steps: a detector, which scope is to detect the regions where a text is present, and a recognizer, which scope is to recognize and read the detected words and numbers. The work was initially developed during the internship experience in the start-up PatchAI, now an Alira Health company. The application of the algorithm in this context is the recognition of textual information on drug boxes. The idea is to deploy such pipeline into an app support, in such a way it can be used by patients, who can take a picture of the box and receive information about the medicine, in particular its posology. Also the use of a vocal assistant that reads orally the recognized text is explored, being a interesting application for ederly or visually impaired people

Padua Thesis and Dissertation Archive

Advanced document data extraction techniques to improve supply chain performance

Author: Sharma Vikash
Publication venue
Publication date: 01/07/2021
Field of study

In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

Repository@Hull - Worktribe

MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis

Author: Arlazarov Vladimir V.
Bulatov Konstantin
Burie Jean-Christophe
Chernyshova Yulia
Emelianova Ekaterina
Luqman Muhammad Muzzamil
Ming Zuheng
Sheshkus Alexander
Skoryukina Natalya
Tropin Daniil
Usilin Sergey
Publication venue
Publication date: 01/07/2021
Field of study

Identity documents recognition is an important sub-field of document analysis, which deals with tasks of robust document detection, type identification, text fields recognition, as well as identity fraud prevention and document authenticity validation given photos, scans, or video frames of an identity document capture. Significant amount of research has been published on this topic in recent years, however a chief difficulty for such research is scarcity of datasets, due to the subject matter being protected by security requirements. A few datasets of identity documents which are available lack diversity of document types, capturing conditions, or variability of document field values. In addition, the published datasets were typically designed only for a subset of document recognition problems, not for a complex identity document analysis. In this paper, we present a dataset MIDV-2020 which consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents, each with unique text field values and unique artificially generated faces, with rich annotation. For the presented benchmark dataset baselines are provided for such tasks as document location and identification, text fields recognition, and face detection. With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset with variable artificially generated data, and we believe that it will prove invaluable for advancement of the field of document analysis and recognition. The dataset is available for download at ftp://smartengines.com/midv-2020 and http://l3i-share.univ-lr.fr

arXiv.org e-Print Archive

Directory of Open Access Journals

Samara University

Knowledge Elicitation in Deep Learning Models

Author: Wagner Luiz da Cunha Ceulin
Publication venue
Publication date: 19/07/2023
Field of study

Embora o aprendizado profundo (mais conhecido como deep learning) tenha se tornado uma ferramenta popular na solução de problemas modernos em vários domínios, ele apresenta um desafio significativo - a interpretabilidade. Esta tese percorre um cenário de elicitação de conhecimento em modelos de deep learning, lançando luz sobre a visualização de características, mapas de saliência e técnicas de destilação. Estas técnicas foram aplicadas a duas arquiteturas: redes neurais convolucionais (CNNs) e um modelo de pacote (Google Vision). A nossa investigação forneceu informações valiosas sobre a sua eficácia na elicitação e interpretação do conhecimento codificado. Embora tenham demonstrado potencial, também foram observadas limitações, sugerindo espaço para mais desenvolvimento neste campo. Este trabalho não só realça a necessidade de modelos de deep learning mais transparentes e explicáveis, como também impulsiona o desenvolvimento de técnicas para extrair conhecimento. Trata-se de garantir uma implementação responsável e enfatizar a importância da transparência e compreensão no aprendizado de máquina. Além de avaliar os métodos existentes, esta tese explora também o potencial de combinar múltiplas técnicas para melhorar a interpretabilidade dos modelos de deep learning. Uma mistura de visualização de características, mapas de saliência e técnicas de destilação de modelos foi usada de uma maneira complementar para extrair e interpretar o conhecimento das arquiteturas escolhidas. Os resultados experimentais destacam a utilidade desta abordagem combinada, revelando uma compreensão mais abrangente dos processos de tomada de decisão dos modelos. Além disso, propomos um novo modelo para a elicitação sistemática de conhecimento em deep learning, que integra de forma coesa estes métodos. Este quadro demonstra o valor de uma abordagem holística para a interpretabilidade do modelo, em vez de se basear num único método. Por fim, discutimos as implicações éticas do nosso trabalho. À medida que os modelos de deep learning continuam a permear vários setores, desde a saúde até às finanças, garantir que as suas decisões são explicáveis e justificadas torna-se cada vez mais crucial. A nossa investigação sublinha esta importância, preparando o terreno para a criação de sistemas de inteligência artificial mais transparentes e responsáveis no futuro.Though a buzzword in modern problem-solving across various domains, deep learning presents a significant challenge - interpretability. This thesis journeys through a landscape of knowledge elicitation in deep learning models, shedding light on feature visualization, saliency maps, and model distillation techniques. These techniques were applied to two deep learning architectures: convolutional neural networks (CNNs) and a black box package model (Google Vision). Our investigation provided valuable insights into their effectiveness in eliciting and interpreting the encoded knowledge. While they demonstrated potential, limitations were also observed, suggesting room for further development in this field. This work does not just highlight the need for more transparent, more explainable deep learning models, it gives a gentle nudge to developing innovative techniques to extract knowledge. It is all about ensuring responsible deployment and emphasizing the importance of transparency and comprehension in machine learning. In addition to evaluating existing methods, this thesis also explores the potential for combining multiple techniques to enhance the interpretability of deep learning models. A blend of feature visualization, saliency maps, and model distillation techniques was used in a complementary manner to extract and interpret the knowledge from our chosen architectures. Experimental results highlight the utility of this combined approach, revealing a more comprehensive understanding of the models' decision-making processes. Furthermore, we propose a novel framework for systematic knowledge elicitation in deep learning, which cohesively integrates these methods. This framework showcases the value of a holistic approach toward model interpretability rather than relying on a single method. Lastly, we discuss the ethical implications of our work. As deep learning models continue to permeate various sectors, from healthcare to finance, ensuring their decisions are explainable and justified becomes increasingly crucial. Our research underscores this importance, laying the groundwork for creating more transparent, accountable AI systems in the future

Repositório Aberto da Universidade do Porto

Trajectory Optimization of a Mobile Camera System for Maximizing Optical Character Recognition

Author: Zabaldo Alexander
Publication venue: Georgia Institute of Technology
Publication date: 15/09/2021
Field of study

Camera systems in motion are subject to significant blurring effects that lead to a loss of information during the image capture. This is especially damaging for optical character recognition for which edge preservation is critical to achieving a high recognition rate. Using non-blind motion deblurring, a trajectory and point spread function can be designed to maximize the recognition rate while meeting endpoint constraints. Optimization through the use of radial basis function networks can therefore be used as a way to find ideal trajectories to reduce blurring effects and preserve text sharpness. This work investigates this problem using simulation of a blurred image capture process. The simulation is automated using radial basis function network optimization and a genetic algorithm to determine trajectories with the best recognition rate. Optimized trajectories yielded recognition scores with up to 57.3% improvement in simulation compared to an analogous linear profile. These results were then verified through physical experimentation with a real-world, controlled-blur image capture process that yielded up to 29.4% improvement across the same comparison. Results were then analyzed using spectral analysis to understand why the chosen trajectories preserve text edges. These findings can be applied to a wide variety of controlled mobile camera platforms, such as autonomous automobiles or unmanned aerial vehicles, to improve their ability to gather information from their environment.M.S

Scholarly Materials And Research @ Georgia Tech