4 research outputs found

    Data Centric Domain Adaptation for Historical Text with OCR Errors

    Get PDF
    We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora

    Melhorando a precisão do reconhecimento de texto usando técnicas baseadas em sintaxe

    Get PDF
    Orientadores: Guido Costa Souza de Araújo, Marcio Machado PereiraDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Devido à grande quantidade de informações visuais disponíveis atualmente, a detecção e o reconhecimento de texto em imagens de cenas naturais começaram a ganhar importância nos últimos tempos. Seu objetivo é localizar regiões da imagem onde há texto e reconhecê-lo. Essas tarefas geralmente são divididas em duas partes: detecção de texto e reconhecimento de texto. Embora as técnicas para resolver esse problema tenham melhorado nos últimos anos, o uso excessivo de recursos de hardware e seus altos custos computacionais impactaram significativamente a execução de tais tarefas em sistemas integrados altamente restritos (por exemplo, celulares e TVs inteligentes). Embora existam métodos de detecção e reconhecimento de texto executados em tais sistemas, eles não apresentam bom desempenho quando comparados à soluções de ponta em outras plataformas de computação. Embora atualmente existam vários métodos de pós-correção que melhoram os resultados em documentos históricos digitalizados, há poucas explorações sobre o seu uso nos resultados de imagens de cenas naturais. Neste trabalho, exploramos um conjunto de métodos de pós-correção, bem como propusemos novas heuríticas para melhorar os resultados em imagens de cenas naturais, tendo como base de prototipação o software de reconhecimento de textos Tesseract. Realizamos uma análise com os principais métodos disponíveis na literatura para correção dos erros e encontramos a melhor combinação que incluiu os métodos de substituição, eliminação nos últimos caracteres e composição. Somado a isto, os resultados mostraram uma melhora quando introduzimos uma nova heurística baseada na frequência com que os possíveis resultados aparecem em bases de dados de magazines, jornais, textos de ficção, web, etc. Para localizar erros e evitar overcorrection foram consideradas diferentes restrições obtidas através do treinamento da base de dados do Tesseract. Selecionamos como melhor restrição a incerteza do melhor resultado obtido pelo Tesseract. Os experimentos foram realizados com sete banco de dados usados em sites de competição na área, considerando tanto banco de dados para desafio em reconhecimento de texto e aqueles com o desafio de detecção e reconhecimento de texto. Em todos os bancos de dados, tanto nos dados de treinamento como de testes, os resultados do Tesseract com o método proposto de pós-correção melhorou consideravelmente em comparação com os resultados obtidos somente com o TesseractAbstract: Due to a large amount of visual information available today, Text Detection and Recognition in scene images have begun to receive an increasing importance. The goal of this task is to locate regions of the image where there is text and recognize them. Such tasks are typically divided into two parts: Text Detection and Text Recognition. Although the techniques to solve this problem have improved in recent years, the excessive usage of hardware resources and its corresponding high computational costs have considerably impacted the execution of such tasks in highly constrained embedded systems (e.g., cellphones and smart TVs). Although there are Text Detection and Recognition methods that run in such systems they do not have good performance when compared to state-of-the-art solutions in other computing platforms. Although there are currently various post-correction methods to improve the results of scanned documents, there is a little effort in applying them on scene images. In this work, we explored a set of post-correction methods, as well as proposed new heuristics to improve the results in scene images, using the Tesseract text recognition software as a prototyping base. We performed an analysis with the main methods available in the literature to correct errors and found the best combination that included the methods of substitution, elimination in the last characters, and compounder. In addition, results showed an improvement when we introduced a new heuristic based on the frequency with which the possible results appear in the frequency databases for categories such as magazines, newspapers, fiction texts, web, etc. In order to locate errors and avoid overcorrection, different restrictions were considered through Tesseract with the training database. We selected as the best restriction the certainty of the best result obtained by Tesseract. The experiments were carried out with seven databases used in Text Recognition and Text Detection/Recognition competitions. In all databases, for both training and testing, the results of Tesseract with the proposed post-correction method considerably improved when compared to the results obtained only with TesseractMestradoCiência da ComputaçãoMestra em Ciência da Computação4716-1488887.335287/2019-00, 1774549FuncampCAPE

    Article Segmentation in Digitised Newspapers

    Get PDF
    Digitisation projects preserve and make available vast quantities of historical text. Among these, newspapers are an invaluable resource for the study of human culture and history. Article segmentation identifies each region in a digitised newspaper page that contains an article. Digital humanities, information retrieval (IR), and natural language processing (NLP) applications over digitised archives improve access to text and allow automatic information extraction. The lack of article segmentation impedes these applications. We contribute a thorough review of the existing approaches to article segmentation. Our analysis reveals divergent interpretations of the task, and inconsistent and often ambiguously defined evaluation metrics, making comparisons between systems challenging. We solve these issues by contributing a detailed task definition that examines the nuances and intricacies of article segmentation that are not immediately apparent. We provide practical guidelines on handling borderline cases and devise a new evaluation framework that allows insightful comparison of existing and future approaches. Our review also reveals that the lack of large datasets hinders meaningful evaluation and limits machine learning approaches. We solve these problems by contributing a distant supervision method for generating large datasets for article segmentation. We manually annotate a portion of our dataset and show that our method produces article segmentations over characters nearly as well as costly human annotators. We reimplement the seminal textual approach to article segmentation (Aiello and Pegoretti, 2006) and show that it does not generalise well when evaluated on a large dataset. We contribute a framework for textual article segmentation that divides the task into two distinct phases: block representation and clustering. We propose several techniques for block representation and contribute a novel highly-compressed semantic representation called similarity embeddings. We evaluate and compare different clustering techniques, and innovatively apply label propagation (Zhu and Ghahramani, 2002) to spread headline labels to similar blocks. Our similarity embeddings and label propagation approach substantially outperforms Aiello and Pegoretti but still falls short of human performance. Exploring visual approaches to article segmentation, we reimplement and analyse the state-of-the-art Bansal et al. (2014) approach. We contribute an innovative 2D Markov model approach that captures reading order dependencies and reduces the structured labelling problem to a Markov chain that we decode with Viterbi (1967). Our approach substantially outperforms Bansal et al., achieves accuracy as good as human annotators, and establishes a new state of the art in article segmentation. Our task definition, evaluation framework, and distant supervision dataset will encourage progress in the task of article segmentation. Our state-of-the-art textual and visual approaches will allow sophisticated IR and NLP applications over digitised newspaper archives, supporting research in the digital humanities