Charles University
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles UniversityNot a member yet
2609 research outputs found
Sort by
Uniform Meaning Representation 2.2
New version of UMR data used in the First Shared Task on UMR Parsing (including submitted systems' outputs)
CRAC 2026 Empty Nodes Baseline Model
The crac2026_empty_nodes_baseline is a XLM-RoBERTa-large–based multilingual model for CRAC 2026 Empty Nodes Baseline system https://github.com/ufal/crac2026_empty_nodes_baseline for predicting empty nodes in the input CoNLL-U files, trained on CorefUD 1.4 data. It was was used to generate baseline empty nodes prediction in the CRAC 2026 Shared Task on Multilingual Coreference Resolution https://ufal.mff.cuni.cz/corefud/crac26.
The model is language agnostic, so in theory it can be used to predict coreference in any XLM-RoBERTa language.
Compared to the last year CRAC 2025 Empty Nodes Baseline https://github.com/ufal/crac2025_empty_nodes_baseline, this year's baseline predicts all available information for the empty nodes, i.e., including forms, lemmas, UPOS, XPOS, and FEATS columns, in addition to previously predicted word order and dependency relations of the empty nodes.
Instructions for running prediction, training, and intrinsic evaluation are all available in the repository CRAC 2026 Empty Nodes Baseline https://github.com/ufal/crac2026_empty_nodes_baseline
LatinISE corpus (version 6) (2026-04-29)
The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre, title, century or specific date.
This Latin corpus was built by Barbara McGillivray.
In the version 5 and 6 of the corpus the author names and datings of texts before 600 CE have been manually corrected and duplicates of texts have been removed. Thanks to Valentina Lunardi for this data curation
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
This dataset contains data for testing machine translation and topic classification in Piedmontese.
It is based on FLORES+ (NLLB Team et al., 2024) and SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects (Adelani et al., EACL 2024)
DeriVallex 1.0
DeriVallex 1.0 is a valency lexicon of automatically generated valency frames of Czech noun and adjectival derivatives the valency of which exhibits systemic correspondences with the valency of their base words. It contains 10,220 derivatives corresponding to 17,288 lexical units (i.e., individual senses). In particular, DeriVallex describes 3,134 nouns corresponding to 5,089 lexical units and 7,086 adjectives corresponding to 12,199 lexical units. DeriVallex was created with the aim of providing information on the valency of nouns and adjectives, which is not sufficiently covered in existing lexical resources. Focusing on nominal and adjectival derivatives that exhibit systematic valency behavior in comparison with their base words, it captures the productive and systemic core of the Czech lexicon, thus laying the foundation for the further extension of current lexical resources. The following word-formation categories are covered: action nouns (e.g., dobytí města nepřáteli ‘conquering the city by enemies’), quality nouns (e.g., učitelova laskavost k dětem ‘the teacher’s kindness to children’), simultaneous action adjectives (e.g., lidé bojující proti bezpráví ‘people fighting against injustice’), anterior action adjectives (e.g., dluh narostlý na 400 milionů ‘a debt that has risen to 400 million’ and muži navrátivší se z války ‘men who have returned from the war’), passive action adjectives (e.g., úspory diktované Evropě konzervativní vládou ‘austerity measures dictated to Europe by a conservative government’), and potentiality adjectives (e.g., dužina oddělitelná od pecky ‘flesh separable from the pit’). In compiling the lexicon, data from the following lexical resources were used: NomVallex 2.6, VALLEX 4.5, and DeriNet 2.3. To satisfy different needs of potential users, the lexicon is distributed (i) online in an HTML version (providing a user-friendly interface allowing human users to search and filter the data) and (ii) in this distribution in a machine-readable form, so that the data can be used in NLP applications.
Authors:
Václava Kettnerová, Jiří Mírovský, Veronika Kolářová and Michal Olbrich
Acknowledgement:
The creation of the DeriVallex lexicon has been supported by the LINDAT/CLARIAH-CZ Research Infrastructure (https://lindat.cz), supported by the Ministry of Education, Youth and Sports of the Czech Republic (Project No. LM2023062), and it has been using data and tools provided by this project too.
License:
DeriVallex is publicly available under the Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA). Its non-commercial use is conditioned by appropriate citation:
Kettnerová, Václava and Mírovský, Jiří and Kolářová, Veronika and Olbrich, Michal. 2026. DeriVallex 1.0. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL). http://hdl.handle.net/11234/1-6109
Verbs annotated for morphemic structure in Czech, English, German, Spanish v2
A sample of verb lemmas in four languages: Czech (19,040 lemmas), English (9,969 lemmas), German (27,158 lemmas), Spanish (11,768 lemmas). Each verb lemma is annotated for its morphemic structure (i.e., segmented into the prefiex(es), root(s), suffix(es) and ending(s) that the given lemma contains), classification of its root morph to a root morpheme where needed (to facilitate grouping of verbs with the same root morpheme), and its frequency of the verb in a 100 M corpus. Two versions are available for each language: one with a more coarse-grained segmentation, which captures the morphemic structure that is synchronically available, and a version with a more fine-grained segmentation, which also captures the word's etymology
CooccurrenceFieldSampler (CFS)
The CooccurrenceFieldSampler (CFS) was developed for sampling from corpora to facilitate lexicographical data analysis. It works with corpora from different sources, text types or years. In random sentence sampling (random/opportunistic sampling), it can be observed that corpora containing different text types and lengths (per source) cannot always be mixed optimally, as they usually do not have the same size and have different topic weightings, for example. The CFS was designed to solve this problem.
The CFS first calculates all co-occurrences for all tokens within sentences – separately for each source. These corpora are then combined in a 1:1 mixture and the co-occurrences for the entire data set are recalculated. The tool evaluates which co-occurrences disappear and which new ones are created, resulting in quotas that control the random mixing of the corpora sentence by sentence.
The end result is a sentence-based corpus that (A) strives to retain the maximum number of co-occurrences from all sources (as accurately as possible) and (B) minimises the rejection of corpus data.
---
To use the CFS tool, follow these steps:
1. Unzip the ZIP file containing the necessary files.
2. For Windows, Linux, and macOS, you will find precompiled binaries that run exclusively on x64 processors.
3. If you are using a different processor type, such as ARM or ARM64, please use the Universal folder.
4. Windows users should run "cfs.exe" directly.
5. For Linux and macOS users, you must mark the cfs file as executable.
6. If using the Universal version, ensure .NET 10.0 is installed for compiling. You can then run the program with "dotnet cfs.dll".
7. To display help information, use the --help parameter.
Help/Parameter:
--from (Default: cec / recommended: cec) import file format (valid: cec, bnc, catma, clan, conll, cora, cwd, dewac, dta, folia, fln, korap, leipzig, xces,
relannis, salt, json, sketch, speedy, tiger, tlv, treetagger, tsv, txm, weblicht)
--input (Default: input/) folder with input-files (mix per file)
--to (Default: cec / recommended: cec) export file format (valid: cec, catma, conll, cwd, csv, dta, folia, i5, korap, xces, plain, salt, json, sketch,
speedy, tlv, tsv, treetagger, txm, weblicht)
--layer (Default: Wort) use this layer to calculate the co-occurrences
--output (Default: output.cec6) output file (every round and logfile)
--minFrequency (Default: 1 / recommended: 5) min. absolute frequency
--minSignificance (Default: 1.0 / recommended: 1.0) min. significance (poisson distribution)
--minChangeRate (Default: 0.1 / recommended: 0.1) min. significance (poisson distribution)
--maxRounds (Default: 10 / recommended: 5) min. absolute frequency
--help Display this help screen.
--version Display version information.
Supported corpus formats (input/output):
cec - CorpusExplorer Corpus (v6) - http://corpusexplorer.de
bnc - British National Corpus - http://www.natcorp.ox.ac.uk/
catma - CATMA (Computer assisted text markup and analysis) - https://catma.de/
clan - CLAN/CHILDES - https://talkbank.org/childes/
conll - CoNLL-U https://universaldependencies.org/format.html
cora - CORA XML - https://cora.readthedocs.io/en/latest/coraxml/
cwd - IMS Open Corpus Workbench (CWB) - https://cwb.sourceforge.io/
dewac - https://wacky.sslmit.unibo.it/doku.php?id=corpora
dta - DTA TCF-XML - https://www.deutschestextarchiv.de/download
folia - FoLiA XML - https://proycon.github.io/folia/
fln - Folker/OrthoNormal - https://exmaralda.org/de/folker-de/
korap - KorAP - http://korap.ids-mannheim.de/
leipzig - Wortschatz Leipzig - https://wortschatz.uni-leipzig.de/en/download/
xces - XCes XML - http://www.xces.org/ / https://www.cs.vassar.edu/CES/
relannis - https://corpus-tools.org/annis/
salt - https://corpus-tools.org/archive-2015-2025/salt/
json - https://de.wikipedia.org/wiki/JSON
sketch - SketchEngine VERT - https://www.sketchengine.eu/glossary/vertical-file/
speedy - SPEEDy Annotation Editor - http://kups.ub.uni-koeln.de/id/eprint/55224
tiger - TiGER-XML - https://www.ims.uni-stuttgart.de/documents/ressourcen/werkzeuge/tigersearch/doc/html/TigerXML.html
tlv - TLV-XML
treetagger - TreeTagger - https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
tsv - Tab-separated values - https://en.wikipedia.org/wiki/Tab-separated_values
txm - TXM - https://txm.gitpages.huma-num.fr/textometrie/?lang=en
weblicht - Weblicht - https://weblicht.sfs.uni-tuebingen.de/weblichtwiki/Main_Page.html
csv - Comma-separated values - https://en.wikipedia.org/wiki/Comma-separated_values
i5 - i5-XML - https://www.ids-mannheim.de/en/digspra/pb-s1/projects/corpus-development/ids-text-model/
plain - Plaintext - https://en.wikipedia.org/wiki/Plain_tex
Projekt_ZDH_transkripce
Text written in kurrent transcribed through Transkribus and then finished by hand
Human Label Variation in Coreference (Hlava Cor)
Human Label Variation in Coreference (Hlava COR) is a collection of commented multiple annotations (three annotators) of coreferential relations in Czech, i.e. the annotation of expressions that refer to the same extra-linguistic entity, concept, or situation. Given an anaphoric expression, annotators were instructed to identify a coreferential expression in the preceding context (if one exists) and to comment on their decision. The main aim of the annotation is to capture variation in the interpretation of coreference among readers. The dataset includes both written and spoken contexts. For detailed and up-to-date information about the corpus, please visit: https://ufal.mff.cuni.cz/hvar/hlava-co
Datasets and R scripts for modelling Czech translation counterparts of Romance causative constructions
This repository contains the datasets and code used in the study “Predicting translation counterparts in causative constructions.”
The datasets consist of annotated examples of Italian and Spanish causative constructions and their Czech translation counterparts. The repository includes (i) full annotated datasets for Italian and Spanish, (ii) revised datasets used for statistical modelling, and (iii) the R script used to estimate Bayesian multinomial regression models using the brms package (Stan backend).
The models estimate the probability of selecting a Czech translation counterpart (TYPE) as a function of verb valency (VALENCY) and complement class (COMP_CLASS), with random effects for VERB and TRANSLATOR.
The repository also contains summaries of the fitted models