8 research outputs found

    Corpus of the Czech language of the 2nd half of the 19th century

    Get PDF
    The paper describes the principles and structure of the one-million-word DIA1900 Corpus built at the Institute of the Czech National Corpus (CNC) in Prague, focused on the language of Czech texts published in the years 1851 to 1900. The DIA1900, planned for publication by June 2020 and to be followed by the DIA1850 (a corpus built around the same principles, with the focus on the first half of the 19th century), observes both the balanced representation of the three major text types (belles lettres — journalistic texts — technical/scientific texts) and the system of morphological tagging implemented in the synchronic corpora included in the CNC project, thus facilitating the diachronic comparison of two stages in the development of Czech. A brief description is given of the structure of the morphological terminology used in the lemmatisation and tagging of the corpus, and of two tools designed to help search the 19th century texts with their fluctuating orthographic consistency combined with phonological and morphological variation characteristics of the language of the period: (1) a multiple select/suggest feature (reminding the user of the existence of non-standard orthographic and phonological variants of the lemma found in the corpus before the lemma search is started) and (2) the position attribute (informing the user of the ambiguous status of a word in the text, resulting from a misprint or misspelling, damaged page etc.).929

    Lexicon of the early Old Czech prose

    Get PDF
    Three main sources of this Dissertation Thesis are three Old Czech translation works, Life of Christ the Lord, Passional and Life of Holy Fathers. All three mentioned works belong to the first prose written in Old Czech. The object of research is their lexical reserve seen from various angles of view. The basis of analysis is a reciprocal comparison of vocabulary of these three above mentioned Old Czech relics. Attention is also given to lexical terms of newer paraphrases of Life of Christ the Lord and Passional that in comparison with the oldest preserved works are showing the development of Old Czech during 14th and 15th centuries. Examined lexis is being also confronted with developmental processes of Old Czech vocabulary. The special attention in particular texts is given to verbal procreation; the part of analysis is a characteristic of emergent or eventually existent terminology; phraseology is being elaborated separately and uniquely or sporadically incident lexemes (hapaxes, etc) that appear in the beginning of Old Czech written prose are also not being omitted. Separately observed texts are incorporated into a more general culturally-historical frame

    Challenges in Accessing Information in Digitized 19th-Century Czech Texts: Paper - iPRES 2012 - Digital Curation Institute, iSchool, Toronto

    No full text
    This short paper describes problems arising in optical character recognition of and information retrieval from historical texts in languages with rich morphology, rather discontinuous lexical development and a long history of spelling reforms. In a work-in- progress manner, the problems and proposed linguistic solutions are shown on the example of the current project focused on improving the access to digitized Czech prints from the 19th century and the first half of the 20th century

    Diakorp v6: diachronic corpus of Czech

    No full text
    Diachronic corpus of Czech sized 3.45 million words (i.e. 4.1 million tokens). It contains 116 texts from the 14th-20th century period. The texts are transcribed, not transliterated. Diakorp v6 is provided in a CoNLL-U-like vertical format used as an input to the Manatee query engine. The data thus correspond to the corpus available via the KonText query interface to the registered users of CNC at http://www.korpus.c
    corecore