20 research outputs found

    Automated Error Detection in Digitized Cultural Heritage Documents

    Get PDF
    International audienceThe work reported in this paper aims at performance optimization in the digitization of documents pertaining to the cultural heritage domain. A hybrid method is roposed, combining statistical classification algorithms and linguistic knowledge to automatize post-OCR error detection and correction. The current paper deals with the integration of linguistic modules and their impact on error detection

    Plague Dot Text:Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

    Get PDF
    The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.Comment: Journal of Data Mining & Digital Humanities 202

    New technologies for Old Germanic: resources and research on parallel bibles in Older Continental Western Germanic

    Get PDF
    We provide an overview of on-going efforts to facilitate the study of older Germanic languages currently pursued at the Goethe-University Frankfurt, Germany. We describe created resources, such as a parallel corpus of Germanic Bibles and a morphosyntactically annotated corpus of Old High German (OHG) and Old Saxon, a lexicon of OHG in XML and a multilingual etymological database. We discuss NLP algorithms operating on this data, and their relevance for research in the Humanities. RDF and Linked Data represent new and promising aspects in our research, currently applied to establish cross-references between etymological dictionaries, infer new information from their symmetric closure and to formalize linguistic annotations in a corpus and grammatical categories in a lexicon in an interoperable way

    Automated Error Detection in Digitized Cultural Heritage Documents

    Get PDF
    International audienceThe work reported in this paper aims at performance optimization in the digitization of documents pertaining to the cultural heritage domain. A hybrid method is roposed, combining statistical classification algorithms and linguistic knowledge to automatize post-OCR error detection and correction. The current paper deals with the integration of linguistic modules and their impact on error detection

    Measuring Greekness: A novel computational methodology to analyze syntactical constructions and quantify the stylistic phenomenon of Attic oratory

    Get PDF
    This study is the result of a compilation and interpretation of data that derive from Classical studies, but are studied and analyzed using computational linguistics, Treebank annotation, and the development and post-processing of metrics. More specifically, the purpose of this work is to employ computational methods so as to analyze a particular form of Ancient Greek language that is Attic Greek, “measure” its attributes, and explore the socio-political connotations that its usage had in the era of the High Roman Empire. During the first centuries CE, the landscape of the Roman Empire is polyvalent. It consists of native Romans who can be fluent in Latin and Greek, Greeks who are Roman citizens, other easterners who are potentially trilingual and have also assumed Roman citizenship, and even Christians, who identify themselves as Roman citizens but with a different religious identity. It comes as no surprise that language is politicized, and identity, both individual and civic, is constantly reshaped through it. The question I attempt to answer is whether we can quantify Greekness of native and bilingual speakers based on an analytic computational study of Attic dialect. Chapter 1 provides a discussion of the three aforementioned scholarly fields, which were pertinent for the study. I present the precepts of computational linguistics, corpus linguistics, and digital humanities so as to further explicate what prompts this work and how the confluence of three methodologies significantly enhances our apprehension of the issue at hand. In Chapter 2, I approach Greekness, Latinity, and Atticism through the writings of Greek and Roman grammarians and lexicographers and provide the complete list of all the occurrences of the aforementioned notions. Chapters 3 and 4 explicate further the reasoning behind the usage of the Perseids framework and the Prague annotation system. They then proceed to relate the metrics developed, the computational methods, and their subsequent visualization to quantify and objectify the previously purely theoretical inferences. The metric system was developed after careful consideration of the stylistic attributes of Ancient Greek. Therefore, each metric “measures” something pertinent in the formation of the language. The visualizations then afford us a more understandable and interpretable format of the numerical results. For philologists, it is interesting to view the graphic presentation of humanistic ideas, and for the computer scientists the applicability of their methods on a topic that is predominantly philological and social. Finally, chapter 5 recontextualizes the numerical results and their interpretations, as were acquired in chapters 3 and 4, and thus sets the parameters necessary to discuss them in conjunction with readings of literary texts of the period of the High Empire. My intention is to show how numbers are “translated” into a different “language,” the language of the humanist.:Acknowledgments Page 6 Chapter 1: Introduction Page 7 1.1 Focus of the Study Page 7 1.2 Classical Studies and Digital Humanities Page 9 1.3 Corpus Linguistics Page 13 1.4 Humanities Corpus and Corpus Linguistics Page 15 1.5 Synopsis of the Project Page 17 Chapter 2: Linguistic Purity as Ethnic and Educational Marker, or Greek and Roman Grammarians on Greek and Latin. Page 22 2.1 Introduction Page 22 2.2 Grammatical and Lexicographic Definitions Page 23 2.2.1 Greek and Latin languages Page 23 2.2.2 Grammatici Graeci Page 29 2.2.3 Grammatici Latini. Page 32 2.3 Greek and Attic in Greek Lexicographers Page 48 2.4 Conclusion Page 57 Chapter 3: Attic Oratory and its Imperial Revival: Quantifying Theory and Practice Page 58 3.1 Introduction Page 58 3.2 Atticism: Definition and Redefinitions Page 59 3.3 Significance of Enhanced Linguistic and Computational Analysis of Atticism Page 65 3.3.1 The Perseids Project, the Prague Mark-up Language, and Dependency Grammar Page 67 3.4 Evaluating Atticism Page 70 3.4.1 Dionysius’s of Halicarnassus Theoretical Framework Page 73 3.5 Methods: Computational Quantification of Rhetorical Styles Page 82 3.5.1 The Perseids 1.5 ALDT Schema Page 84 3.5.2 Node-based Sentence Metrics Page 93 3.5.3 Computer Implementation Page 104 3.6 Conclusion Page 108 Chapter 4: Experimental results, Analysis, and Topological Haar Wavelets Page 110 4.1 Introduction Page 110 4.2 Experimental Results Page 111 4.3 Data Visualization Page 117 4. 4 Topological Metric Wavelets for Syntactical Quantification Page 153 4.4.1 Wavelets Page 154 4.4.2 Topological Metrics using Wavelets Page 155 4.4.3 Experimental Results Page 157 4.5 Conclusion Page 162 Chapter 5: «Γαλάτης ὢν ἑλληνίζειν»: Greekness, Latinity, and Otherness in the World of the High Empire. Page 163 5.1 Introduction Page 163 5.2 The Multiethnical Constituents of an Imperial Citizen: Anacharsis, Favorinus, and Dionysius’s of Halicarnassus Ethnography. Page 165 5.3 Conclusion Page 185 Chapter 6: Conclusion Page 187 References Page 190 Appendix Page 203 Curriculum Vitae Page 212 Dissertation related Publications Page 225 Selbständigkeitserklärung Page 22

    Designing a Library of Components for Textual Scholarship

    Get PDF
    Il presente lavoro affronta e descrive temi legati all'applicazione di nuove tecnologie, di metodologie informatiche e di progettazione software volti allo sviluppo di strumenti innovativi per le Digital Humanities (DH), un’area di studio caratterizzata da una forte interdisciplinarità e da una continua evoluzione. In particolare, questo contributo definisce alcuni specifici requisiti relativi al dominio del Literary Computing e al settore del Digital Textual Scholarship. Conseguentemente, il contesto principale di elaborazione tratta documenti scritti in latino, greco e arabo, nonché testi in lingue moderne contenenti temi storici e filologici. L'attività di ricerca si concentra sulla progettazione di una libreria modulare (TSLib) in grado di operare su fonti ad elevato valore culturale, al fine di editarle, elaborarle, confrontarle, analizzarle, visualizzarle e ricercarle. La tesi si articola in cinque capitoli. Il capitolo 1 riassume il contesto del dominio applicativo e fornisce un quadro generale degli obiettivi e dei benefici della ricerca. Il capitolo 2 illustra alcuni importanti lavori e iniziative analoghe, insieme a una breve panoramica dei risultati più significativi ottenuti nel settore delle DH. Il capitolo 3 ripercorre accuratamente e motiva il processo di progettazione messo a punto. Esso inizia con la descrizione dei principi tecnici adottati e mostra come essi vengono applicati al dominio d'interesse. Il capitolo continua definendo i requisiti, l'architettura e il modello del metodo proposto. Sono così evidenziati e discussi gli aspetti concernenti i design patterns e la progettazione delle Application Programming Interfaces (APIs). La parte finale del lavoro (capitolo 4) illustra i risultati ottenuti da concreti progetti di ricerca che, da un lato, hanno contribuito alla progettazione della libreria e, dall'altro, hanno avuto modo di sfruttarne gli sviluppi. Sono stati quindi discussi diversi temi: (a) l'acquisizione e la codifica del testo, (b) l'allineamento e la gestione delle varianti testuali, (c) le annotazioni multilivello. La tesi si conclude con alcune riflessioni e considerazioni indicando anche possibili percorsi d'indagine futuri (capitolo 5)
    corecore