684 research outputs found

    Named Entity Recognition in multilingual handwritten texts

    Full text link
    [ES] En nuestro trabajo presentamos un único modelo basado en aprendizaje profundo para la transcripción automática y el reconocimiento de entidades nombradas de textos manuscritos. Este modelo aprovecha las capacidades de generalización de sistemas de reconocimiento, combinando redes neuronales artificiales y n-gramas de caracteres. Se discute la evaluación de dicho sistema y, como consecuencia, se propone una nueva medida de evaluación. Con el fin de mejorar los resultados con respecto a dicha métrica, se evalúan diferentes estrategias de corrección de errores.[EN] In our work we present a single Deep Learning based model for the automatic transcription and Named Entity Recognition of handwritten texts. Such model leverages the generalization capabilities of recognition systems, combining Artificial Neural Networks and n-gram character models. The evaluation of said system is discussed and, as a consequence, a new evaluation metric is proposed. As a means to improve the results in regards to such metric, different error correction strategies are assessed.Villanova Aparisi, D. (2021). Named Entity Recognition in multilingual handwritten texts. Universitat Politècnica de València. http://hdl.handle.net/10251/174942TFG

    Design of an Offline Handwriting Recognition System Tested on the Bangla and Korean Scripts

    Get PDF
    This dissertation presents a flexible and robust offline handwriting recognition system which is tested on the Bangla and Korean scripts. Offline handwriting recognition is one of the most challenging and yet to be solved problems in machine learning. While a few popular scripts (like Latin) have received a lot of attention, many other widely used scripts (like Bangla) have seen very little progress. Features such as connectedness and vowels structured as diacritics make it a challenging script to recognize. A simple and robust design for offline recognition is presented which not only works reliably, but also can be used for almost any alphabetic writing system. The framework has been rigorously tested for Bangla and demonstrated how it can be transformed to apply to other scripts through experiments on the Korean script whose two-dimensional arrangement of characters makes it a challenge to recognize. The base of this design is a character spotting network which detects the location of different script elements (such as characters, diacritics) from an unsegmented word image. A transcript is formed from the detected classes based on their corresponding location information. This is the first reported lexicon-free offline recognition system for Bangla and achieves a Character Recognition Accuracy (CRA) of 94.8%. This is also one of the most flexible architectures ever presented. Recognition of Korean was achieved with a 91.2% CRA. Also, a powerful technique of autonomous tagging was developed which can drastically reduce the effort of preparing a dataset for any script. The combination of the character spotting method and the autonomous tagging brings the entire offline recognition problem very close to a singular solution. Additionally, a database named the Boise State Bangla Handwriting Dataset was developed. This is one of the richest offline datasets currently available for Bangla and this has been made publicly accessible to accelerate the research progress. Many other tools were developed and experiments were conducted to more rigorously validate this framework by evaluating the method against external datasets (CMATERdb 1.1.1, Indic Word Dataset and REID2019: Early Indian Printed Documents). Offline handwriting recognition is an extremely promising technology and the outcome of this research moves the field significantly ahead

    OCR Post Correction for Endangered Language Texts

    Full text link
    There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.Comment: Accepted to EMNLP 202

    GENDERED READINGS OF RITUAL: EXPLORING NARRATIVES OF CHINESE RELIGION THROUGH NINETEENTH CENTURY CHRISTIAN MISSIONARY WRITINGS

    Get PDF
    This thesis presents gendered narratives of Chinese religion as revealed through the writings of late Nineteenth Century Christian missionaries. Through a recontextualized, material and practical approach to these sources I uncover examples of non-elite ritual practice. I utilize the personal experiences and philological work of Protestant men and women to explore instances of religion at two well-known sites of Chinese Buddhism, Putuoshan and Wutaishan. I reveal how religious adherents, both lay and ordained are classified and depicted though a Western Protestant lens. This exploration highlights how personal and non-elite narratives of Chinese religion produced by missionary women have been continually undervalued within the academic study of Chinese religion. I propose a means to overcome embedded Protestant biases within our own scholarly tradition through acknowledging the authority of ritual, of human action, within Chinese religion and within secondary missionary sources

    Metaheuristic approach on feature extraction and classification algorithm for handwrittten character recognition

    Get PDF
    Handwritten Character Recognition (HCR) is a process of converting handwritten text into machine readable form and it comprises three stages; preprocessing, feature extraction and classification. This study acknowledged the issues regarding HCR performances particularly at the feature extraction and classification stages. In relation to feature extraction stage, the problem identified is related to continuous and minimum chain code feature extraction at its starting and revisit points due to branches of handwritten character. As for the classification stage, the problems identified are related to the input feature for classification that results in low accuracy of classification and classification model particularly in Artificial Neural Network (ANN) learning problem. Thus, the aim of this study is to extract the continuous chain code feature for handwritten character along with minimising its length and then proceed to develop and enhance the ANN classification model based on the extracted chain code in order to identify the handwritten character better. Four phases were involved in accomplishing the aim of this study. First, thinning algorithm was applied to remove the redundancies of pixel in handwritten character binary image. Second, graph based-metaheuristic feature extraction algorithm was proposed to extract the continuous chain code feature of the handwritten character image while minimising the route length of the chain code. Graph theory was then utilised as a solution representation. Hence, two metaheuristic approaches were adopted; Harmony Search Algorithm (HSA) and Flower Pollination Algorithm (FPA). As a result, HSA graphbased metaheuristic feature extraction algorithm was proposed to extract the continuous chain code feature for handwritten character. Based on the experiment conducted, it was demonstrated that the HSA graph-based metaheuristic feature extraction algorithm showed better performance in generating the shortest route length of chain code with minimum computational time compared to FPA. Furthermore, based on the evaluation of previous works, the proposed algorithm showed notable performance in terms of shortest route length of chain code for extracting handwritten character. Third, a feature vector was derived to address the input feature issue. The derivation of feature vector based on proposed formation rule namely Local Value Formation Rule (LVFR) and Global Value Formation Rule (GVFR) was adopted to create the image features for classification purpose. ANN was applied to classify the handwritten character based on the derived feature vector. Fourth, a hybrid of Firefly Algorithm (FA) and ANN (FA-ANN) classification model was proposed to solve the ANN network learning issue. Confusion Matrix was generated to evaluate the performance of the model in terms of precision, sensitivity, specificity, F-score, accuracy and error rate. As a result, the proposed hybrid FA-ANN classification model is superior in classifying the handwritten characters compared to the proposed feature vector-based ANN with 1.59 percent incremental in terms of accuracy model. Furthermore, the proposed hybrid FA-ANN also exhibits better performances compared to previous related works on HCR

    Text Recognition for Nepalese Manuscripts in Pracalit Script

    Get PDF
    This dataset is a model for handwritten text recognition (HTR) of Sanskrit and Newar Nepalese manuscripts in Pracalit script. This paper introduces the state of the field in Newar literature, Newar manuscripts, and HTR engines. It explains our methodology for developing the requisite ground truth consisting of manuscript images and corresponding transcriptions, training our model with a PyLAia engine, and this model’s limitations. This dataset shared on Zenodo can be used by anyone working with manuscripts in Pracalit script, which will benefit the fields of Indology and Newar studies, as well as historical and linguistic analysis
    corecore