36 research outputs found

    Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

    Get PDF
    This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

    La stylométrie comme outil pour la recherche de l’élaboration des chartes médiévales. Le cas de Cambrai au douzième siècle (1131-1200)

    Get PDF
    Introduction : le développement de la méthodologie La démarche numérique récente dans les humanités a mené au développement de plusieurs logiciels et méthodes. Dans les domaines de l’histoire médiévale et de la diplomatique en particulier, ceux-ci s’intéressent à la paléographie numérique, ce qui a soulevé des questions sur les possibilités et les limites des ordinateurs. Alors que la paléographie numérique a évolué vers une discipline presque indépendante, l’étude du dictamen s’est développé..

    Challenging stylometry: The authorship of the baroque play La Segunda Celestina

    Get PDF
    The aim of this study was to verify the possibility of Sor Juana Ine´ s de la Cruz authoring the anonymous part of the baroque play La Segunda Celestina, commissioned to Agustın de Salazar, and left unfinished after his death. This is a first systematic stylometric study on this problem and a baroque hispanoamerican text. In our study, we faced building a balanced corpus from few available resources, and took extensive evaluation measures to deal with unclear stylometric signals. We use a variety of established attribution and verification methods, and introduce a novel evaluation procedure of examining historic texts with scarce corpora. The results support Sor Juana’s authorship, and unravel new connections between her and other authors of the time, showing, still undermined, powerful impact of her works on the epoch. The solutions adopted in solving methodological problems of such a complex task show how stylometry can overcome similar challenges

    Digitális Bölcsészet 2021

    Get PDF

    Digitális Bölcsészet 2021/5 (Különszám)

    Get PDF
    Bemutatkozik a krakkói Computational Stylistics Group. A különszámot Szemes Botond szerkesztette

    Sistemas de reconocimiento de textos e impresos hispánicos de la Edad Moderna. La creación de unos modelos de HTR para la transcripción automatizada de documentos en gótica y redonda (s. XV-XVII)

    Get PDF
    The work presents the recent achievements in the field of text recognition carried out in 2021 thanks to the collaboration between the following projects: Progetto Mambrino (Univ. of Verona), BIDISO (Univ. of A Coruña) and COMEDIC (Univ. of Zaragoza). Specifically, the first part of the article describes the state of the art of automatic transcription systems in relation to the recognition of printed texts of the Modern Age, the first experiences carried out with the Transkribus platform (READ Coop) and the preliminary results obtained. In the second part, we present two HTR models that allow the automatic transcription of early printed texts in gothic and round scripts of the Modern Age (15th-17th centuries). In two final appendices, the documents used for the creation of both models are described according to current typobibliographical standards.El trabajo presenta los recientes logros en el campo del reconocimiento de textos llevado a cabo en 2021 gracias a la colaboración entre los siguientes proyectos: Progetto Mambrino (Univ. de Verona), BIDISO (Univ. de A Coruña) y COMEDIC (Univ. de Zaragoza). En concreto, en la primera parte del artículo se describe el estado de la cuestión de los sistemas de transcripción automática en relación con los textos impresos de la Edad Moderna, se relatan las primeras experiencias llevadas a cabo con la plataforma Transkribus (READ Coop) y los resultados preliminares obtenidos. En la segunda parte se presentan dos modelos de HTR que consienten la transcripción automática de textos en letra gótica y redonda de la Edad Moderna (siglos XV-XVII). En dos apéndices finales se describen según las normas tipobibliográficas actuales los documentos empleados para la creación de ambos modelos

    Sistemas de reconocimiento de textos e impresos hispánicos de la Edad Moderna. La creación de unos modelos de HTR para la transcripción automatizada de documentos en gótica y redonda (s. XV-XVII)

    Get PDF
    The work presents the recent achievements in the field of text recognition carried out in 2021 thanks to the collaboration between the following projects: Progetto Mambrino (Univ. of Verona), BIDISO (Univ. of A Coruña) and COMEDIC (Univ. of Zaragoza). Specifically, the first part of the article describes the state of the art of automatic transcription systems in relation to the recognition of printed texts of the Modern Age, the first experiences carried out with the Transkribus platform (READ Coop) and the preliminary results obtained. In the second part, we present two HTR models that allow the automatic transcription of early printed texts in gothic and round scripts of the Modern Age (15th-17th centuries). In two final appendices, the documents used for the creation of both models are described according to current typobibliographical standardsEl trabajo presenta los recientes logros en el campo del reconocimiento de textos llevado a cabo en 2021 gracias a la colaboración entre los siguientes proyectos: Progetto Mambrino (Univ. de Verona), BIDISO (Univ. de A Coruña) y COMEDIC (Univ. de Zaragoza). En concreto, en la primera parte del artículo se describe el estado de la cuestión de los sistemas de transcripción automática en relación con los textos impresos de la Edad Moderna, se relatan las primeras experiencias llevadas a cabo con la plataforma Transkribus (READ Coop) y los resultados preliminares obtenidos. En la segunda parte se presentan dos modelos de HTR que consienten la transcripción automática de textos en letra gótica y redonda de la Edad Moderna (siglos XV-XVII). En dos apéndices finales se describen según las normas tipobibliográficas actuales los documentos empleados para la creación de ambos modelo

    Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

    No full text
    This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above 48 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution
    corecore