2,405 research outputs found

    A Path Planning for Line Segmentation of Handwritten Documents

    Full text link

    A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing

    Full text link
    Optical Character Recognition (OCR) technology finds applications in digitizing books and unstructured documents, along with applications in other domains such as mobility statistics, law enforcement, traffic, security systems, etc. The state-of-the-art methods work well with the OCR with printed text on license plates, shop names, etc. However, applications such as printed textbooks and handwritten texts have limited accuracy with existing techniques. The reason may be attributed to similar-looking characters and variations in handwritten characters. Since these issues are challenging to address with OCR technologies exclusively, we propose a post-processing approach using Natural Language Processing (NLP) tools. This work presents an end-to-end pipeline that first performs OCR on the handwritten or printed text and then improves its accuracy using NLP.Comment: Accepted in IEEE GCON (IEEE Guwahati Subsection Conference) 202

    A Machine Learning-based Approach to Vietnamese Handwritten Medical Record Recognition

    Get PDF
    Handwritten text recognition has been an active research topic within computer vision division. Existing deep-learning solutions are practical; however, recognizing Vietnamese handwriting has shown to be a challenge with the presence of extra six distinctive tonal symbols and extra vowels. Vietnam is a developing country with a population of approximately 100 million, but has only focused on digitalization transforms in recent years, and so Vietnam has a significant number of physical documents, that need to be digitized. This digitalization transform is urgent when considering the public health sector, in which medical records are mostly still in hand-written form and still are growing rapidly in number. Digitization would not only help current public health management but also allow preparation and management in future public health emergencies. Enabling the digitalization of old physical records will allow efficient and precise care, especially in emergency units. We proposed a solution to Vietnamese text recognition that is combined into an end-to-end document-digitalization system. We do so by performing segmentation to word-level and then leveraging an artificial neural network consisting of both convolutional neural network (CNN) and a long short-term memory recurrent neural network (LSTM) to propagate the sequence information. From the experiment with the records written by 12 doctors, we have obtained encouraging results of 6.47% and 19.14% of CER and WER respectively

    Advances in Document Layout Analysis

    Full text link
    [EN] Handwritten Text Segmentation (HTS) is a task within the Document Layout Analysis field that aims to detect and extract the different page regions of interest found in handwritten documents. HTS remains an active topic, that has gained importance with the years, due to the increasing demand to provide textual access to the myriads of handwritten document collections held by archives and libraries. This thesis considers HTS as a task that must be tackled in two specialized phases: detection and extraction. We see the detection phase fundamentally as a recognition problem that yields the vertical positions of each region of interest as a by-product. The extraction phase consists in calculating the best contour coordinates of the region using the position information provided by the detection phase. Our proposed detection approach allows us to attack both higher level regions: paragraphs, diagrams, etc., and lower level regions like text lines. In the case of text line detection we model the problem to ensure that the system's yielded vertical position approximates the fictitious line that connects the lower part of the grapheme bodies in a text line, commonly known as the baseline. One of the main contributions of this thesis, is that the proposed modelling approach allows us to include prior information regarding the layout of the documents being processed. This is performed via a Vertical Layout Model (VLM). We develop a Hidden Markov Model (HMM) based framework to tackle both region detection and classification as an integrated task and study the performance and ease of use of the proposed approach in many corpora. We review the modelling simplicity of our approach to process regions at different levels of information: text lines, paragraphs, titles, etc. We study the impact of adding deterministic and/or probabilistic prior information and restrictions via the VLM that our approach provides. Having a separate phase that accurately yields the detection position (base- lines in the case of text lines) of each region greatly simplifies the problem that must be tackled during the extraction phase. In this thesis we propose to use a distance map that takes into consideration the grey-scale information in the image. This allows us to yield extraction frontiers which are equidistant to the adjacent text regions. We study how our approach escalates its accuracy proportionally to the quality of the provided detection vertical position. Our extraction approach gives near perfect results when human reviewed baselines are provided.[ES] La Segmentación de Texto Manuscrito (STM) es una tarea dentro del campo de investigación de Análisis de Estructura de Documentos (AED) que tiene como objetivo detectar y extraer las diferentes regiones de interés de las páginas que se encuentran en documentos manuscritos. La STM es un tema de investigación activo que ha ganado importancia con los años debido a la creciente demanda de proporcionar acceso textual a las miles de colecciones de documentos manuscritos que se conservan en archivos y bibliotecas. Esta tesis entiende la STM como una tarea que debe ser abordada en dos fases especializadas: detección y extracción. Consideramos que la fase de detección es, fundamentalmente, un problema de clasificación cuyo subproducto son las posiciones verticales de cada región de interés. Por su parte, la fase de extracción consiste en calcular las mejores coordenadas de contorno de la región utilizando la información de posición proporcionada por la fase de detección. Nuestro enfoque de detección nos permite atacar tanto regiones de alto nivel (párrafos, diagramas¿) como regiones de nivel bajo (líneas de texto principalmente). En el caso de la detección de líneas de texto, modelamos el problema para asegurar que la posición vertical estimada por el sistema se aproxime a la línea ficticia que conecta la parte inferior de los cuerpos de los grafemas en una línea de texto, comúnmente conocida como línea base. Una de las principales aportaciones de esta tesis es que el enfoque de modelización propuesto nos permite incluir información conocida a priori sobre la disposición de los documentos que se están procesando. Esto se realiza mediante un Modelo de Estructura Vertical (MEV). Desarrollamos un marco de trabajo basado en los Modelos Ocultos de Markov (MOM) para abordar tanto la detección de regiones como su clasificación de forma integrada, así como para estudiar el rendimiento y la facilidad de uso del enfoque propuesto en numerosos corpus. Así mismo, revisamos la simplicidad del modelado de nuestro enfoque para procesar regiones en diferentes niveles de información: líneas de texto, párrafos, títulos, etc. Finalmente, estudiamos el impacto de añadir información y restricciones previas deterministas o probabilistas a través de el MEV propuesto que nuestro enfoque proporciona. Disponer de un método independiente que obtiene con precisión la posición de cada región detectada (líneas base en el caso de las líneas de texto) simplifica enormemente el problema que debe abordarse durante la fase de extracción. En esta tesis proponemos utilizar un mapa de distancias que tiene en cuenta la información de escala de grises de la imagen. Esto nos permite obtener fronteras de extracción que son equidistantes a las regiones de texto adyacentes. Estudiamos como nuestro enfoque aumenta su precisión de manera proporcional a la calidad de la detección y descubrimos que da resultados casi perfectos cuando se le proporcionan líneas de base revisadas por humanos.[CA] La Segmentació de Text Manuscrit (STM) és una tasca dins del camp d'investigació d'Anàlisi d'Estructura de Documents (AED) que té com a objectiu detectar I extraure les diferents regions d'interès de les pàgines que es troben en documents manuscrits. La STM és un tema d'investigació actiu que ha guanyat importància amb els anys a causa de la creixent demanda per proporcionar accés textual als milers de col·leccions de documents manuscrits que es conserven en arxius i biblioteques. Aquesta tesi entén la STM com una tasca que ha de ser abordada en dues fases especialitzades: detecció i extracció. Considerem que la fase de detecció és, fonamentalment, un problema de classificació el subproducte de la qual són les posicions verticals de cada regió d'interès. Per la seva part, la fase d'extracció consisteix a calcular les millors coordenades de contorn de la regió utilitzant la informació de posició proporcionada per la fase de detecció. El nostre enfocament de detecció ens permet atacar tant regions d'alt nivell (paràgrafs, diagrames ...) com regions de nivell baix (línies de text principalment). En el cas de la detecció de línies de text, modelem el problema per a assegurar que la posició vertical estimada pel sistema s'aproximi a la línia fictícia que connecta la part inferior dels cossos dels grafemes en una línia de text, comunament coneguda com a línia base. Una de les principals aportacions d'aquesta tesi és que l'enfocament de modelització proposat ens permet incloure informació coneguda a priori sobre la disposició dels documents que s'estan processant. Això es realitza mitjançant un Model d'Estructura Vertical (MEV). Desenvolupem un marc de treball basat en els Models Ocults de Markov (MOM) per a abordar tant la detecció de regions com la seva classificació de forma integrada, així com per a estudiar el rendiment i la facilitat d'ús de l'enfocament proposat en nombrosos corpus. Així mateix, revisem la simplicitat del modelatge del nostre enfocament per a processar regions en diferents nivells d'informació: línies de text, paràgrafs, títols, etc. Finalment, estudiem l'impacte d'afegir informació i restriccions prèvies deterministes o probabilistes a través del MEV que el nostre mètode proporciona. Disposar d'un mètode independent que obté amb precisió la posició de cada regió detectada (línies base en el cas de les línies de text) simplifica enormement el problema que ha d'abordar-se durant la fase d'extracció. En aquesta tesi proposem utilitzar un mapa de distàncies que té en compte la informació d'escala de grisos de la imatge. Això ens permet obtenir fronteres d'extracció que són equidistants de les regions de text adjacents. Estudiem com el nostre enfocament augmenta la seva precisió de manera proporcional a la qualitat de la detecció i descobrim que dona resultats quasi perfectes quan se li proporcionen línies de base revisades per humans.Bosch Campos, V. (2020). Advances in Document Layout Analysis [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/138397TESI

    BN-DRISHTI: Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images

    Full text link
    Handwriting recognition remains challenging for some of the most spoken languages, like Bangla, due to the complexity of line and word segmentation brought by the curvilinear nature of writing and lack of quality datasets. This paper solves the segmentation problem by introducing a state-of-the-art method (BN-DRISHTI) that combines a deep learning-based object detection framework (YOLO) with Hough and Affine transformation for skew correction. However, training deep learning models requires a massive amount of data. Thus, we also present an extended version of the BN-HTRd dataset comprising 786 full-page handwritten Bangla document images, line and word-level annotation for segmentation, and corresponding ground truths for word recognition. Evaluation on the test portion of our dataset resulted in an F-score of 99.97% for line and 98% for word segmentation. For comparative analysis, we used three external Bangla handwritten datasets, namely BanglaWriting, WBSUBNdb_text, and ICDAR 2013, where our system outperformed by a significant margin, further justifying the performance of our approach on completely unseen samples.Comment: Will be published under the Springer Springer Lecture Notes in Computer Science (LNCS) series, as part of ICDAR WML 202

    Multi-script handwritten character recognition:Using feature descriptors and machine learning

    Get PDF

    Standardizing, Segmenting and Tenderizing Letters and Improving the Quality of Envelope Images to Extract Postal Addresses

    Get PDF
    In most mechanized postal systems, envelopes are scanned based on the postal standard using mechanical instruments. In the standard format, the image of envelopes lacks tilts, lines are along the horizontal axis and words are placed in a correct and non-oblique manner. In this article a new algorithm for rotating, segmentation and Tenderizing Letters for standardizing and increasing the quality of an envelope has been presented, which can be used in all text identification systems as three successful pre-processing algorithms. In the algorithm proposed, letters with any forms and tilts during scanning were rotated and standardized by applying a simple two-step algorithm based on what was written on the envelope without requiring the calculation of tilt angle. After standardization, the main regions of the image were specified using the histogram information. Then, in a simple algorithm, the candidate points from the pixels related to the text on the envelope were selected and quality improvement and tenderization were done on the main regions of the image. The advantaged of the proposed algorithm included No need for additional mechanical equipment, less calculation, simplicity and consideration of the structure of words on the envelope in all preprocessing phases.DOI:http://dx.doi.org/10.11591/ijece.v2i3.34

    Character Recognition

    Get PDF
    Character recognition is one of the pattern recognition technologies that are most widely used in practical applications. This book presents recent advances that are relevant to character recognition, from technical topics such as image processing, feature extraction or classification, to new applications including human-computer interfaces. The goal of this book is to provide a reference source for academic research and for professionals working in the character recognition field
    corecore