16 research outputs found

    06491 Abstracts Collection -- Digital Historical Corpora- Architecture, Annotation, and Retrieval

    Get PDF
    From 03.12.06 to 08.12.06, the Dagstuhl Seminar 06491 ``Digital Historical Corpora - Architecture, Annotation, and Retrieval\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if availabl

    Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

    Get PDF
    In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.Peer reviewe

    Digital Methods in the Humanities: Challenges, Ideas, Perspectives

    Get PDF
    Digital Humanities is a transformational endeavor that not only changes the perception, storage, and interpretation of information but also of research processes and questions. It also prompts new ways of interdisciplinary communication between humanities scholars and computer scientists. This volume offers a unique perspective on digital methods for and in the humanities. It comprises case studies from various fields to illustrate the challenge of matching existing textual research practices and digital tools. Problems and solutions with and for training tools as well as the adjustment of research practices are presented and discussed with an interdisciplinary focus

    Digital Methods in the Humanities

    Get PDF
    Digital Humanities is a transformational endeavor that not only changes the perception, storage, and interpretation of information but also of research processes and questions. It also prompts new ways of interdisciplinary communication between humanities scholars and computer scientists. This volume offers a unique perspective on digital methods for and in the humanities. It comprises case studies from various fields to illustrate the challenge of matching existing textual research practices and digital tools. Problems and solutions with and for training tools as well as the adjustment of research practices are presented and discussed with an interdisciplinary focus

    The Taming of the Shrew - non-standard text processing in the Digital Humanities

    Get PDF
    Natural language processing (NLP) has focused on the automatic processing of newspaper texts for many years. With the growing importance of text analysis in various areas such as spoken language understanding, social media processing and the interpretation of text material from the humanities, techniques and methodologies have to be reviewed and redefined since so called non-standard texts pose challenges on the lexical and syntactic level especially for machine-learning-based approaches. Automatic processing tools developed on the basis of newspaper texts show a decreased performance for texts with divergent characteristics. Digital Humanities (DH) as a field that has risen to prominence in the last decades, holds a variety of examples for this kind of texts. Thus, the computational analysis of the relationships of Shakespeare’s dramatic characters requires the adjustment of processing tools to English texts from the 16th-century in dramatic form. Likewise, the investigation of narrative perspective in Goethe’s ballads calls for methods that can handle German verse from the 18th century. In this dissertation, we put forward a methodology for NLP in a DH environment. We investigate how an interdisciplinary context in combination with specific goals within projects influences the general NLP approach. We suggest thoughtful collaboration and increased attention to the easy applicability of resulting tools as a solution for differences in the store of knowledge between project partners. Projects in DH are not only constituted by the automatic processing of texts but are usually framed by the investigation of a research question from the humanities. As a consequence, time limitations complicate the successful implementation of analysis techniques especially since the diversity of texts impairs the transferability and reusability of tools beyond a specific project. We answer to this with modular and thus easily adjustable project workflows and system architectures. Several instances serve as examples for our methodology on different levels. We discuss modular architectures that balance time-saving solutions and problem-specific implementations on the example of automatic postcorrection of the output text from an optical character recognition system. We address the problem of data diversity and low resource situations by investigating different approaches towards non-standard text processing. We examine two main techniques: text normalization and tool adjustment. Text normalization aims at the transformation of non-standard text in order to assimilate it to the standard whereas tool adjustment concentrates on the contrary direction of enabling tools to successfully handle a specific kind of text. We focus on the task of part-of-speech tagging to illustrate various approaches toward the processing of historical texts as an instance for non-standard texts. We discuss how the level of deviation from a standard form influences the performance of different methods. Our approaches shed light on the importance of data quality and quantity and emphasize the indispensability of annotations for effective machine learning. In addition, we highlight the advantages of problem-driven approaches where the purpose of a tool is clearly formulated through the research question. Another significant finding to emerge from this work is a summary of the experiences and increased knowledge through collaborative projects between computer scientists and humanists. We reflect on various aspects of the elaboration and formalization of research questions in the DH and assess the limitations and possibilities of the computational modeling of humanistic research questions. An emphasis is placed on the interplay of expert knowledge with respect to a subject of investigation and the implementation of tools for that purpose and the thereof resulting advantages such as the targeted improvement of digital methods through purposeful manual correction and error analysis. We show obstacles and chances and give prospects and directions for future development in this realm of interdisciplinary research

    Digital Methods in the Humanities

    Get PDF
    Digital Humanities is a transformational endeavor that not only changes the perception, storage, and interpretation of information but also of research processes and questions. It also prompts new ways of interdisciplinary communication between humanities scholars and computer scientists. This volume offers a unique perspective on digital methods for and in the humanities. It comprises case studies from various fields to illustrate the challenge of matching existing textual research practices and digital tools. Problems and solutions with and for training tools as well as the adjustment of research practices are presented and discussed with an interdisciplinary focus

    Does color modalities affect handwriting recognition? An empirical study on Persian handwritings using convolutional neural networks

    Full text link
    Most of the methods on handwritten recognition in the literature are focused and evaluated on Black and White (BW) image databases. In this paper we try to answer a fundamental question in document recognition. Using Convolutional Neural Networks (CNNs), as eye simulator, we investigate to see whether color modalities of handwritten digits and words affect their recognition accuracy or speed? To the best of our knowledge, so far this question has not been answered due to the lack of handwritten databases that have all three color modalities of handwritings. To answer this question, we selected 13,330 isolated digits and 62,500 words from a novel Persian handwritten database, which have three different color modalities and are unique in term of size and variety. Our selected datasets are divided into training, validation, and testing sets. Afterwards, similar conventional CNN models are trained with the training samples. While the experimental results on the testing set show that CNN on the BW digit and word images has a higher performance compared to the other two color modalities, in general there are no significant differences for network accuracy in different color modalities. Also, comparisons of training times in three color modalities show that recognition of handwritten digits and words in BW images using CNN is much more efficient

    Melhorando a precisão do reconhecimento de texto usando técnicas baseadas em sintaxe

    Get PDF
    Orientadores: Guido Costa Souza de Araújo, Marcio Machado PereiraDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Devido à grande quantidade de informações visuais disponíveis atualmente, a detecção e o reconhecimento de texto em imagens de cenas naturais começaram a ganhar importância nos últimos tempos. Seu objetivo é localizar regiões da imagem onde há texto e reconhecê-lo. Essas tarefas geralmente são divididas em duas partes: detecção de texto e reconhecimento de texto. Embora as técnicas para resolver esse problema tenham melhorado nos últimos anos, o uso excessivo de recursos de hardware e seus altos custos computacionais impactaram significativamente a execução de tais tarefas em sistemas integrados altamente restritos (por exemplo, celulares e TVs inteligentes). Embora existam métodos de detecção e reconhecimento de texto executados em tais sistemas, eles não apresentam bom desempenho quando comparados à soluções de ponta em outras plataformas de computação. Embora atualmente existam vários métodos de pós-correção que melhoram os resultados em documentos históricos digitalizados, há poucas explorações sobre o seu uso nos resultados de imagens de cenas naturais. Neste trabalho, exploramos um conjunto de métodos de pós-correção, bem como propusemos novas heuríticas para melhorar os resultados em imagens de cenas naturais, tendo como base de prototipação o software de reconhecimento de textos Tesseract. Realizamos uma análise com os principais métodos disponíveis na literatura para correção dos erros e encontramos a melhor combinação que incluiu os métodos de substituição, eliminação nos últimos caracteres e composição. Somado a isto, os resultados mostraram uma melhora quando introduzimos uma nova heurística baseada na frequência com que os possíveis resultados aparecem em bases de dados de magazines, jornais, textos de ficção, web, etc. Para localizar erros e evitar overcorrection foram consideradas diferentes restrições obtidas através do treinamento da base de dados do Tesseract. Selecionamos como melhor restrição a incerteza do melhor resultado obtido pelo Tesseract. Os experimentos foram realizados com sete banco de dados usados em sites de competição na área, considerando tanto banco de dados para desafio em reconhecimento de texto e aqueles com o desafio de detecção e reconhecimento de texto. Em todos os bancos de dados, tanto nos dados de treinamento como de testes, os resultados do Tesseract com o método proposto de pós-correção melhorou consideravelmente em comparação com os resultados obtidos somente com o TesseractAbstract: Due to a large amount of visual information available today, Text Detection and Recognition in scene images have begun to receive an increasing importance. The goal of this task is to locate regions of the image where there is text and recognize them. Such tasks are typically divided into two parts: Text Detection and Text Recognition. Although the techniques to solve this problem have improved in recent years, the excessive usage of hardware resources and its corresponding high computational costs have considerably impacted the execution of such tasks in highly constrained embedded systems (e.g., cellphones and smart TVs). Although there are Text Detection and Recognition methods that run in such systems they do not have good performance when compared to state-of-the-art solutions in other computing platforms. Although there are currently various post-correction methods to improve the results of scanned documents, there is a little effort in applying them on scene images. In this work, we explored a set of post-correction methods, as well as proposed new heuristics to improve the results in scene images, using the Tesseract text recognition software as a prototyping base. We performed an analysis with the main methods available in the literature to correct errors and found the best combination that included the methods of substitution, elimination in the last characters, and compounder. In addition, results showed an improvement when we introduced a new heuristic based on the frequency with which the possible results appear in the frequency databases for categories such as magazines, newspapers, fiction texts, web, etc. In order to locate errors and avoid overcorrection, different restrictions were considered through Tesseract with the training database. We selected as the best restriction the certainty of the best result obtained by Tesseract. The experiments were carried out with seven databases used in Text Recognition and Text Detection/Recognition competitions. In all databases, for both training and testing, the results of Tesseract with the proposed post-correction method considerably improved when compared to the results obtained only with TesseractMestradoCiência da ComputaçãoMestra em Ciência da Computação4716-1488887.335287/2019-00, 1774549FuncampCAPE

    Rapid Resource Transfer for Multilingual Natural Language Processing

    Get PDF
    Until recently the focus of the Natural Language Processing (NLP) community has been on a handful of mostly European languages. However, the rapid changes taking place in the economic and political climate of the world precipitate a similar change to the relative importance given to various languages. The importance of rapidly acquiring NLP resources and computational capabilities in new languages is widely accepted. Statistical NLP models have a distinct advantage over rule-based methods in achieving this goal since they require far less manual labor. However, statistical methods require two fundamental resources for training: (1) online corpora (2) manual annotations. Creating these two resources can be as difficult as porting rule-based methods. This thesis demonstrates the feasibility of acquiring both corpora and annotations by exploiting existing resources for well-studied languages. Basic resources for new languages can be acquired in a rapid and cost-effective manner by utilizing existing resources cross-lingually. Currently, the most viable method of obtaining online corpora is converting existing printed text into electronic form using Optical Character Recognition (OCR). Unfortunately, a language that lacks online corpora most likely lacks OCR as well. We tackle this problem by taking an existing OCR system that was desgined for a specific language and using that OCR system for a language with a similar script. We present a generative OCR model that allows us to post-process output from a non-native OCR system to achieve accuracy close to, or better than, a native one. Furthermore, we show that the performance of a native or trained OCR system can be improved by the same method. Next, we demonstrate cross-utilization of annotations on treebanks. We present an algorithm that projects dependency trees across parallel corpora. We also show that a reasonable quality treebank can be generated by combining projection with a small amount of language-specific post-processing. The projected treebank allows us to train a parser that performs comparably to a parser trained on manually generated data
    corecore