10 research outputs found

    Digitale Erschließung einer Sammlung von Volksliedern aus dem deutschsprachigen Raum

    Get PDF
    Dieser Beitrag beschreibt ein laufendes Projekt zur digitalen Erschließung einer großen Sammlung von Volksliedern aus dem deutschsprachigen Raum, mit dem Ziel diese später über ein öffentliches Informationssystem verfügbar zu machen. Mithilfe dieses Informationssystems soll neben der üblichen Exploration gescannter Faksimiles der Originalliedblätter zusätzlich ein quantitativer Zugang zu den Daten ermöglicht werden, der diese anhand unterschiedlicher Parameter durchsuchbar und analysierbar macht. Ziel des Projekts ist also nicht nur, einen in dieser Form einzigartigen Bestand an Liedblättern nachhaltig digital zu erschließen und zugänglich zu machen, sondern darüber hinaus computergestützt nach Auffälligkeiten in Form wiederkehrender Phrasen und Themen oder melodischen Universalien zu suchen, die für verschiedene Regionen oder Zeitabschnitte charakteristisch sind

    A comprehensive dataset of environmentally contaminated sites in the state of São Paulo in Brazil

    Get PDF
    In the Brazilian state of São Paulo, contaminated sites (CSs) constitute threats to health, environment and socioeconomic situation of populations. Over the past two decades, the Environmental Agency of São Paulo (CETESB) has monitored these known CSs. This paper discusses the produced dataset through digitising the CETESB reports and making them accessible to the public in English. The dataset reports on qualitative aspects of contamination within the registered sites (e.g., contamination type and spread) and their management status. The data was extracted from CETESB reports using a machine-learning computer vision algorithm. It comprises two components: an optical character recognition (OCR) engine for text extraction and a convolutional neural network (CNN) image classifier to identify checked boxes. The digitisation was followed by harmonisation and quality assurance processes to ensure the consistency and validity of the data. Making this dataset accessible will allow future work on predictive analysis and decision-making and will inform the required policy-making to improve the management of the CSs in Brazil

    Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

    Full text link
    Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.Comment: 25 pages, 4 figure

    An OCR Post-correction Approach using Deep Learning for Processing Medical Reports

    Get PDF
    According to a recent Deloitte study, the COVID-19 pandemic continues to place a huge strain on the global health care sector. Covid-19 has also catalysed digital transformation across the sector for improving operational efficiencies. As a result, the amount of digitally stored patient data such as discharge letters, scan images, test results or free text entries by doctors has grown significantly. In 2020, 2314 exabytes of medical data was generated globally. This medical data does not conform to a generic structure and is mostly in the form of unstructured digitally generated or scanned paper documents stored as part of a patient’s medical reports. This unstructured data is digitised using Optical Character Recognition (OCR) process. A key challenge here is that the accuracy of the OCR process varies due to the inability of current OCR engines to correctly transcribe scanned or handwritten documents in which text may be skewed, obscured or illegible. This is compounded by the fact that processed text is comprised of specific medical terminologies that do not necessarily form part of general language lexicons. The proposed work uses a deep neural network based self-supervised pre-training technique: Robustly Optimized Bidirectional Encoder Representations from Transformers (RoBERTa) that can learn to predict hidden (masked) sections of text to fill in the gaps of non-transcribable parts of the documents being processed. Evaluating the proposed method on domain-specific datasets which include real medical documents, shows a significantly reduced word error rate demonstrating the effectiveness of the approach

    Перспективы развития фундаментальных наук. Т. 7 : IT-технологии и электроника

    Get PDF
    Сборник содержит труды участников XIX Международной конференции студентов, аспирантов и молодых учёных «Перспективы развития фундаментальных наук», представленные на секции «IT-технологии и электроника». Предназначен для студентов, аспирантов, молодых ученых и преподавателей, специализирующихся в области интеллектуальных систем управления, автоматизированных систем обработки информации и управления, информационной безопасности, наноэлектроники, получения и исследования наноматериалов, оптоэлектроники и нанофотоники, плазменной эмиссионной электроники, интеллектуальной силовой электроники, СВЧ электроники, систем радиолокации, телевидения, радиосвязи, радиометрии и распространения волн радиочастотного и акустического диапазонов, а также импульсных и радиочастотных измерениях

    Corpus linguistics for History:the methodology of investigating place-name discourses in digitised nineteenth-century newspapers

    Get PDF
    The increasing availability of historical sources in a digital form has led to calls for new forms of reading in history. This thesis responds to these calls by exploring the potential of approaches from the field of corpus linguistics to be useful to historical research. Specifically, two sets of methodological issues are considered that arise when corpus linguistic methods are used on digitised historical sources. The first set of issues surrounds optical character recognition (OCR), computerised text transcription based on image reproduction of the original printed source. This process is error-prone, which leads to potentially unreliable word-counts. I find that OCR errors are very varied, and more different from their corrections than natural spelling variation from a standard form. As a result of OCR errors, the test OCR corpus examined has a slightly inflated overall token count (as compared to a hand-corrected gold standard), and a vastly inflated type count. Not all spurious types are infrequent: around 7% of types occurring at least 10 times in my test OCR corpus are spurious. I also find evidence that real-word errors occur. Assessing the impact of OCR errors on two common collocation statistics, Mutual Information (MI) and Log-Likelihood (LL), I find that both are affected by OCR errors. This analysis also provides evidence that OCR errors are not homogenously distributed throughout the corpus. Nevertheless, for small collocation spans, MI rankings are broadly reliable in OCR data, especially when used in combination with an LL threshold. Large spans are best avoided, as both statistics become increasingly less reliable in OCR data, when used with larger spans. Both statistics attract non-negligible rates of false positives. Using a frequency floor will eliminate many OCR errors, but does not reduce the rates of MI and LL false positives. Assessing the potential of two post-OCR correction methods, I find that VARD, a program designed to standardise natural spelling variation, proves unpromising for dealing with OCR errors. By contrast, Overproof, a commercial system designed for OCR errors, is effective, and its application leads to substantial improvements in the reliability of MI and LL, particularly for large spans. The second set of issues relate to the effectiveness of approaches to analysing the discourses surrounding place-names in digitised nineteenth-century newspapers. I single out three approaches to identifying place-names mentioned in large amounts of text without the need for a geo-parser system. The first involves relying on USAS, a semantic tagger, which has a 'Z2' tag for geographic names. This approach cannot identify multi-word place-names, but is scalable. A difficulty is that frequency counts of place-names do not account for their possible polysemy; I suggest a procedure involving reading a random sample of concordance lines for each place-name, in order to obtain an estimate of the actual number of mentions of that place-name in reference to a specific place. This method is best used to identify the most frequent place-names. A second, related, approach is to automatically compare a list of words tagged 'Z2' with a gazetteer, a reference list of place-names. This method, however, suffers from the same difficulties as the previous one, and is best used when accurate frequency counts are not required. A third approach involves starting from a principled, text-external, list of place-names, such as a population table, then attempting to locate each place in the set of texts. The scalability of this method depends on the length of the list of place-names, but it can accommodate any quantity of text. Its advantage over the two other methods is that it helps to contextualise the findings and can help identify place-names which are not mentioned in the texts. Finally, I consider two approaches to investigating the discourses surrounding place-names in large quantities of text. Both are scalable operationalisations of proximity-based collocation. The first approach starts with the whole corpus, searching for the place-name of interest and generating a list of statistical collocates of the place-name; these collocates can then be further categorised and analysed via concordance analysis. The second approach starts with small samples of concordance lines for the place-name of interest, and involves analysing these concordance lines to develop a framework for description of the phraseologies within which place-names are mentioned. Both methods are useful and scalable; the findings they yield are, to some extent, overlapping, but also complementary. This suggests that both methods may be fruitfully used together, albeit neither is ideally-suited for comparing results across corpora. Both approaches are well-suited for exploratory research

    An open-source OCR evaluation tool

    No full text
    corecore