84 research outputs found
ANNIS: a linguistic database for exploring information structure
In this paper, we discuss the design and implementation of our first version of the database "ANNIS" (ANNotation of Information Structure). For research based on empirical data, ANNIS provides a uniform environment for storing this data together with its linguistic annotations. A central database promotes standardized annotation, which facilitates interpretation and comparison of the data. ANNIS is used through a standard web browser and offers tier-based visualization of data and annotations, as well as search facilities that allow for cross-level and cross-sentential queries. The paper motivates the design of the system, characterizes its user interface, and provides an initial technical evaluation of ANNIS with respect to data size and query processing
Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin
In this paper we describe a dataset of German and Latin \textit{ground truth}
(GT) for historical OCR in the form of printed text line images paired with
their transcription. This dataset, called \textit{GT4HistOCR}, consists of
313,173 line pairs covering a wide period of printing dates from incunabula
from the 15th century to 19th century books printed in Fraktur types and is
openly available under a CC-BY 4.0 license. The special form of GT as line
image/transcription pairs makes it directly usable to train state-of-the-art
recognition models for OCR software employing recurring neural networks in LSTM
architecture such as Tesseract 4 or OCRopus. We also provide some pretrained
OCRopus models for subcorpora of our dataset yielding between 95\% (early
printings) and 98\% (19th century Fraktur printings) character accuracy rates
on unseen test cases, a Perl script to harmonize GT produced by different
transcription rules, and give hints on how to construct GT for OCR purposes
which has requirements that may differ from linguistically motivated
transcriptions.Comment: Submitted to JLCL Volume 33 (2018), Issue 1: Special Issue on
Automatic Text and Layout Recognitio
Abstract pronominal anaphors and label nouns in German and English: Selected case studies and quantitative investigations
Abstract anaphors refer to abstract referents, such as facts or events. This paper presents a corpus-based comparative study of German and English abstract
anaphors. Parallel bi-directional texts from the Europarl Corpus were annotated
with functional and morpho-syntactic information, focusing on the pronouns âitâ,
âthisâ, and âthatâ, as well as demonstrative noun phrases headed by âlabel nounsâ,
such as âthis eventâ, âthat issueâ, etc., and their German counterparts. We induce
information about the cross-linguistic realization of abstract anaphors from the
parallel texts. The contrastive findings are then controlled for translation-specific
characteristics by examination of the differences between the original text and the
translated text in each of the languages. In selected case studies, we investigate in
detail âtranslation mismatchesâ, including changes in grammatical category (from
pronouns to full noun phrases, and vice versa), grammatical function, or clausal
position, addition or omission of modifying adjectives, changes in the lexical realization of head nouns, and transpositions of the demonstrative determiner. In
some of these cases, the specificity of the abstract noun phrase is altered by the
translation process
ReM fĂźr Mediävist*innen. Perspektiven des Referenzkorpus Mittelhochdeutsch (1050â1350) fĂźr germanistisch-mediävistische Fragestellungen
In diesem Beitrag wollen wir illustrieren, wie die historischen ReferenzÂkorÂpora fĂźr germanistisch-mediävistische Fragestellungen genutzt werden kĂśnnen. Wir tun dies anhand von drei beispielhaften Fragestellungen, fĂźr die wir das RefeÂrenzÂkorpus Mittelhochdeutsch auswerten: (i) Merkmalszuschreibung Ăźber AttriÂbuÂieÂrungen der Personennamen; (ii) Personifizierung; (iii) Metaphorisierung. Der BeiÂtrag zeigt, wie das Referenzkorpus Mittelhochdeutsch und seine Annotationen (LemÂma, Wortart) mit dem Korpussuchtool ANNIS durchsucht werden kann und wie die entsprechenden Treffer auch quantitativ ausgewertet werden kĂśnnen
Metaphors of Religion
The CRC studies the role of metaphor in religious meaning-making. In metaphors, meaning is transferred from one semantic domain to another. By adopting conceptual metaphor theory, the CRC seeks to more thoroughly understand this process and to research its semantic forms empirically and comparatively. Through its multidisciplinary subprojects the CRC contributes to the historiography and the comparative study of religions. It covers various religious traditions from across the globe, working with texts from multiple languages and diverse genres, dating from 3,000 BCE to the present.
To enable comparability and interoperability between its extremely heterogeneous subprojects, the CRC deliberately puts emphasis on a shared digital infrastructure (data repository, annotation-tool, conceptual thesaurus), provided by the information infrastructure (INF) project. Utilizing this infrastructure, the subprojects annotate religious texts to not only mark the presence of metaphors, but to include complex analysis of the structural functionings of the metaphor and the resulting domain mappings.
A contribution to the 9. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" - DHd 2023 Open Humanities Open Culture
- âŚ