84 research outputs found

    Digitale Korpora in der Lehre - Anwendungsbeispiele aus der Theoretischen Linguistik und der Computerlinguistik

    Get PDF

    Morphological and Part-of-Speech Tagging of Historical Language Data: A Comparison

    Get PDF

    ANNIS: a linguistic database for exploring information structure

    Get PDF
    In this paper, we discuss the design and implementation of our first version of the database "ANNIS" (ANNotation of Information Structure). For research based on empirical data, ANNIS provides a uniform environment for storing this data together with its linguistic annotations. A central database promotes standardized annotation, which facilitates interpretation and comparison of the data. ANNIS is used through a standard web browser and offers tier-based visualization of data and annotations, as well as search facilities that allow for cross-level and cross-sentential queries. The paper motivates the design of the system, characterizes its user interface, and provides an initial technical evaluation of ANNIS with respect to data size and query processing

    Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

    Get PDF
    In this paper we describe a dataset of German and Latin \textit{ground truth} (GT) for historical OCR in the form of printed text line images paired with their transcription. This dataset, called \textit{GT4HistOCR}, consists of 313,173 line pairs covering a wide period of printing dates from incunabula from the 15th century to 19th century books printed in Fraktur types and is openly available under a CC-BY 4.0 license. The special form of GT as line image/transcription pairs makes it directly usable to train state-of-the-art recognition models for OCR software employing recurring neural networks in LSTM architecture such as Tesseract 4 or OCRopus. We also provide some pretrained OCRopus models for subcorpora of our dataset yielding between 95\% (early printings) and 98\% (19th century Fraktur printings) character accuracy rates on unseen test cases, a Perl script to harmonize GT produced by different transcription rules, and give hints on how to construct GT for OCR purposes which has requirements that may differ from linguistically motivated transcriptions.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on Automatic Text and Layout Recognitio

    Abstract pronominal anaphors and label nouns in German and English: Selected case studies and quantitative investigations

    Get PDF
    Abstract anaphors refer to abstract referents, such as facts or events. This paper presents a corpus-based comparative study of German and English abstract anaphors. Parallel bi-directional texts from the Europarl Corpus were annotated with functional and morpho-syntactic information, focusing on the pronouns ‘it’, ‘this’, and ‘that’, as well as demonstrative noun phrases headed by “label nouns”, such as ‘this event’, ‘that issue’, etc., and their German counterparts. We induce information about the cross-linguistic realization of abstract anaphors from the parallel texts. The contrastive findings are then controlled for translation-specific characteristics by examination of the differences between the original text and the translated text in each of the languages. In selected case studies, we investigate in detail “translation mismatches”, including changes in grammatical category (from pronouns to full noun phrases, and vice versa), grammatical function, or clausal position, addition or omission of modifying adjectives, changes in the lexical realization of head nouns, and transpositions of the demonstrative determiner. In some of these cases, the specificity of the abstract noun phrase is altered by the translation process

    ReM für Mediävist*innen. Perspektiven des Referenzkorpus Mittelhochdeutsch (1050–1350) für germanistisch-mediävistische Fragestellungen

    Get PDF
    In diesem Beitrag wollen wir illustrieren, wie die historischen Referenz­kor­pora fßr germanistisch-mediävistische Fragestellungen genutzt werden kÜnnen. Wir tun dies anhand von drei beispielhaften Fragestellungen, fßr die wir das Refe­renz­korpus Mittelhochdeutsch auswerten: (i) Merkmalszuschreibung ßber Attri­bu­ie­rungen der Personennamen; (ii) Personifizierung; (iii) Metaphorisierung. Der Bei­trag zeigt, wie das Referenzkorpus Mittelhochdeutsch und seine Annotationen (Lem­ma, Wortart) mit dem Korpussuchtool ANNIS durchsucht werden kann und wie die entsprechenden Treffer auch quantitativ ausgewertet werden kÜnnen

    Metaphors of Religion

    Get PDF
    The CRC studies the role of metaphor in religious meaning-making. In metaphors, meaning is transferred from one semantic domain to another. By adopting conceptual metaphor theory, the CRC seeks to more thoroughly understand this process and to research its semantic forms empirically and comparatively. Through its multidisciplinary subprojects the CRC contributes to the historiography and the comparative study of religions. It covers various religious traditions from across the globe, working with texts from multiple languages and diverse genres, dating from 3,000 BCE to the present. To enable comparability and interoperability between its extremely heterogeneous subprojects, the CRC deliberately puts emphasis on a shared digital infrastructure (data repository, annotation-tool, conceptual thesaurus), provided by the information infrastructure (INF) project. Utilizing this infrastructure, the subprojects annotate religious texts to not only mark the presence of metaphors, but to include complex analysis of the structural functionings of the metaphor and the resulting domain mappings. A contribution to the 9. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" - DHd 2023 Open Humanities Open Culture
    • …
    corecore