Search CORE

2 research outputs found

Uncertainty in the production of Czech noun and verb forms

Author: Bermel N.
Knittl L.
Nikolaev A.
Publication venue: 'Edinburgh University Press'
Publication date
Field of study

We examine the reactions of Czech native speakers to cues asking them to supply inflectional forms of nouns and verbs that are either canonical (non-variant), overabundant, or supposedly defective, to see what distinguishing characteristics these three conditions have for production. We find that respondents handle defective material differently from other conditions, producing different sorts of forms at different frequencies, and taking significantly longer to do so. Overabundant cells pattern at the individual level like canonical inflectional cells, but collectively display a significantly more varied and less focused spread of forms produced than our canonical cells. The individual dimension of uncertainty in production is thus limited to defective cells, but the collective dimension of uncertainty is evident between all three conditions

White Rose Research Online

Czech Web Corpus 2017 (csTenTen17)

Author: Suchomel Vít
Publication venue: Lexical Computing CZ s.r.o.
Publication date: 07/12/2018
Field of study

The Czech Web Corpus 2017 (csTenTen17) is a Czech corpus made up of texts collected from the Internet, mostly from the Czech national top level domain ".cz". The data was crawled by web crawler SpiderLing (https://corpus.tools/wiki/SpiderLing). The data was cleaned by removing boilerplate (using https://corpus.tools/wiki/Justext), removing near-duplicate paragraphs (by https://corpus.tools/wiki/Onion) and discarding paragraphs not in the target language. The corpus was POS annotated by morphological analyser Majka using this POS tagset: https://www.sketchengine.eu/tagset-reference-for-czech/. Text sources: General web, Wikipedia. Time span of crawling: May, October and November 2017, October and November 2016, October and November 2015. The Czech Wikipedia part was downloaded in November 2017. Data format: Plain text, vertical (one token per line), gzip compressed. There are the following structures in the vertical: Documents (, usually corresponding to web pages), paragraphs (), sentences () and word join markers (, a "glue" tag indicating that there was no space between the surrounding tokens in the original text). Document metadata: src (the source of the data), title (the title of the web page), url (the URL of the document), crawl_date (the date of downloading the document). Paragraph metadata: heading ("1" if the paragraph is a heading, usually to elements in the original HTML data). Block elements in the case of an HTML source or double blank lines in the case of other source formats were used as paragraph separators. An internal heuristic tool was used to mark sentence breaks. The tab-separated positional attributes are: word form, morphological annotation, lem-POS (the base form of the word, i.e. the lemma, with a part of speech suffix) and gender respecting lemma (nouns and adjectives only). Please cite the following paper when using the corpus for your research: Suchomel, Vít. csTenTen17, a Recent Czech Web Corpus. In Recent Advances in Slavonic Natural Language Processing, pp. 111–123. 2018. (https://nlp.fi.muni.cz/raslan/raslan18.pdf#page=119

LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University