Charles University
LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles UniversityNot a member yet
2527 research outputs found
Sort by
CEC6-Converter (2025-05-29)
Diese Software erlaubt eine Konvertierung von *.cec6-Dateien in 24 Formate, die in der Korpuslinguistik / NLProc üblich sind. Die Ausführung ist unter allen modernen Betriebssystemen möglich (Windows, Linux, MacOS). Die Binärdateien wurden für die x64-Architektur kompiliert. Sollten Sie einen Prozessor (CPU) verwenden, der eine x86- oder ARM-Architektur hat, dann nutzen Sie bitte die Anleitung: andere Betriebssysteme bzw. x86 / ARM / ARM64.
---
This software allows the conversion of *.cec6 files into 24 formats that are commonly used in corpus linguistics / NLProc. Execution is possible under all modern operating systems (Windows, Linux, MacOS). The binary files have been compiled for the x64 architecture. If you are using a processor (CPU) with x86 or ARM architecture, please use the instructions for "other operating systems or x86 / ARM / ARM64"
Uniform Meaning Representation 2.0
The goal of the Uniform Meaning Representation (UMR) project is to design a meaning representation that can be used to annotate the semantic content of a text. UMR is primarily based on Abstract Meaning Representation (AMR), an annotation framework initially designed for English, but also draws from other meaning representations. UMR extends AMR to other languages, particularly morphologically complex, low-resource languages. UMR also adds features to AMR that are critical to semantic interpretation and enhances AMR by proposing a companion document-level representation that captures linguistic phenomena such as coreference as well as temporal and modal dependencies that potentially go beyond sentence boundaries. UMR is intended to be scalable, learnable, and cross-linguistically plausible. It is designed to support both lexical and logical inference
EduPo: Analysis and Generation of Czech Poetry, v0.5
A suite of tools for analysis and generation of Czech poetry. This is a snapshot of the public Github repository at https://github.com/ufal/edupo -- the beta-version of the tool suite, released together with a scientific paper at the NLP4DH 2025 conference.
Sada nástrojů pro analýzu a generování české poezie. Tato verze veřejného repozitáře na Githubu https://github.com/ufal/edupo je beta-verzí doprovázející vydání vědeckého článku na konferenci NLP4DH 2025
Diadem Speech-Cognitive Dataset (DSCD-CZ)
The dataset was created to investigate the speech and cognitive performance of people with varying degrees of cognitive impairment, primarily dementia. The dataset contains a comprehensive set of data including the results of standardized neuropsychological tests (RBANS, ALBA, POBAV, MASTCZ), speech tasks focused on comprehension, memory, naming, and repetition, and demographic data (age, gender, education).
Participants were divided into four groups based on clinical assessment: healthy individuals, healthy individuals with possible mild cognitive impairment, patients with mild cognitive impairment, and patients with dementia. All recordings and examinations were managed as part of routine clinical practice in the neurological outpatient clinic – Memory Disorders Advisory Unit, at the Neurological Clinic of the Faculty Hospital Královské Vinohrady. The dataset containing 268 examinations was divided into a training and test part using stratification by clinical group, age, gender, and level of education to ensure an even distribution of these key characteristics in both parts of the data.
The aim of the dataset is to support the development of methods for automated detection of cognitive disorders based on speech analysis and cognitive performance. The data are suitable for research in the areas of clinical neuropsychology, computational linguistics, and machine learning. The dataset is intended for non-commercial research purposes
LatinISE corpus (version 5)
The LatinISE corpus is a text corpus collected from the LacusCurtius, Intratext and Musisque Deoque websites. Corpus texts have rich metadata containing information as genre, title, century or specific date.
This Latin corpus was built by Barbara McGillivray.
In the version 5 of the corpus the author names and datings of texts before 600 CE have been manually corrected and duplicates of texts have been removed. Thanks to Valentina Lunardi for this data curation
Coreference in Universal Dependencies 1.3 (CorefUD 1.3)
CorefUD is a collection of previously existing datasets annotated with coreference, which we converted into a common annotation scheme. In total, CorefUD in its current version 1.3 consists of 28 datasets for 18 languages. The datasets are enriched with automatic morphological and syntactic annotations that are fully compliant with the standards of the Universal Dependencies project. All the datasets are stored in the CoNLL-U format, with coreference- and bridging-specific information captured by attribute-value pairs located in the MISC column. The collection is divided into a public edition and a non-public (ÚFAL-internal) edition. The publicly available edition is distributed via LINDAT-CLARIAH-CZ and contains 24 datasets for 17 languages (1 dataset for Ancient Greek, 1 for Ancient Hebrew, 1 for Catalan, 2 for Czech, 3 for English, 2 for French, 2 for German, 1 for Hindi, 2 for Hungarian, 1 for Korean, 1 for Lithuanian, 2 for Norwegian, 1 for Old Church Slavonic, 1 for Polish, 1 for Russian, 1 for Spanish, and 1 for Turkish), excluding the test data. The non-public edition is available internally to ÚFAL members and contains additional 4 datasets for 2 languages (1 dataset for Dutch, and 3 for English), which we are not allowed to distribute due to their original license limitations. It also contains the test data portions for all datasets. When using any of the harmonized datasets, please get acquainted with its license (placed in the same directory as the data) and cite the original data resource too. Compared to the previous version 1.2, the version 1.3 comprises new languages and corpora, namely French-ANCOR, Hindi-HDTB, and Korean-ECMT. In addition, English-GUM and Czech-PDT have been updated to newer versions and conversion of zeros in Hungarian-KorKor has been improved (a list of all changes in each dataset can be found in the corresponding README file)
NameTag 3 Multilingual Model 250203
This is a trained model for the supervised machine learning tool NameTag 3 (https://ufal.mff.cuni.cz/nametag/3/). NameTag 3 is an open-source tool for both flat and nested named entity recognition (NER). NameTag 3 identifies proper names in text and classifies them into a set of predefined categories, such as names of persons, locations, organizations, etc. The model was trained jointly on 21 flat NE corpora of 17 languages: Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, German, Maghrebi Arabic French, Norwegian Bokmaal, Norwegian Nynorsk, Portuguese, Serbian, Slovak, Spanish, Swedish, and Ukrainian. The model documentation can be found at https://ufal.mff.cuni.cz/nametag/3/models#multilingual
Czech Etymological Lexicon 1.0
The Czech Etymological Lexicon, version 1.0, contains 10,502 Czech words, each annotated with a sequence of ISO 639-3 language codes representing its etymological origin. The dataset is provided in a simple tab-separated format, with the first column containing the lemma and the second listing the language codes separated by commas.
Example entry:
architekt deu,lat,ell loan
The word architekt originated from Greek, and came to Czech through Latin and German.
The third column indicates whether the word is a loanword (marked as "loan") or a native word (marked as "native"). Note that "native" refers to inherited words as opposed to loanwords.
The language sequences were extracted from the printed dictionary REJZEK, Jiří. Český etymologický slovník [Czech etymological dictionary]. LEDA, 2015. The extraction of language sequences from the entries in the original dictionary was fully automated and, therefore, may contain imperfections. Please refer to the original dictionary for highly precise information
Debiasing Algorithm through Model Adaptation
Debiasing Algorithm through Model Adaptation (DAMA) is based on guarding stereotypical gender signals and model editing. DAMA is performed on specific modules prone to convey gender bias, as shown by causal tracing. Our novel method effectively reduces gender bias in LLaMA models in three diagnostic tests: generation, coreference (WinoBias), and stereotypical sentence likelihood (StereoSet). The method does not change the model’s architecture, parameter count, or inference cost. We have also shown that the model’s performance in language modeling and a diverse set of downstream tasks is almost unaffected. This package contains both the source codes and English, English-to-Czech, and English-to-German datasets
Universal Dependencies 2.16
Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008)