Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
816 research outputs found
Sort by
The "Mobile languages" corpus MoJezik 1.0 (audio)
The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno dialect, Rovte group) and Ribnica (Dolenjska dialect, Dolenjska group), who study or work in the Slovenian capital, Ljubljana, and thus navigate daily between dialectal and standard language use. Interview topics include narratives of personal (linguistic) history, reflections on past and present language practices, attitudes towards their own dialects and other Slovene varieties, experiences of dialect perception in the Ljubljana context and of standard-like speech in local environments, linguistic identity, stereotypes and prejudices, intergenerational language use (especially with children), and language behaviour in educational settings.
The corpus includes:
– Idrija group: 4 speakers (2 women, 2 men; 2 adults, 2 secondary-school students), recorded between 2009 and 2013; total interview length: 5 hours, 37 minutes, 9 seconds.
– Ribnica group: 6 speakers (2 primary informants and 4 close contacts, including family members, friends, and colleagues), recorded between 2020 and 2022; total interview length: 4 hours, 37 minutes, 15 seconds.
The interviews were conducted within the framework of broader sociolinguistic research, which also encompassed informants’ self-recordings of spontaneous speech in diverse everyday situations and a quantitative variationist analysis of five phonological variables (dialect-specific) across various communicative contexts. The interview data enable comparisons between speakers’ metalinguistic commentary and their actual language use as documented in the recordings.
The findings of the Cerkno and Ribnica studies are comprehensively presented in two scientific publications:
* Bitenc, Maja, 2016: Z jezikom na poti med Idrijskim in Ljubljano [With Language on the Move Between Idrija and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete.
* Bitenc, Maja (in press): Govor v gibanju med Ribnico in Ljubljano [Speech in Motion Between Ribnica and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete.
This entry contains only audio recordings, and only for speakers who have consented to the publication of their recordings. The transcriptions are available in a separate entry: The "Mobile Languages" corpus MoJezik 1.0 (transcription), http://hdl.handle.net/11356/2037
Monitor corpus of Slovene Trendi 2025-05
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-05 covers the period from January 2019 to May 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from May 2025
The CLASSLA-Stanza model for UD dependency parsing of standard Slovenian 2.2
This model for UD dependency parsing of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1747) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1204) expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~90.42.
The difference to the previous version of the model is that the model was trained using the improved SUK 1.1 version of the training corpus
The CLASSLA-Stanza model for lemmatisation of spoken Slovenian 2.2
This model for lemmatisation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated F1 of the lemma annotations is ~99.23
Frequency list of collocations from the Šolar 3.0 corpus
The frequency list of collocations from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml").
There are 3 output files:
- solar-orig3.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 81 (out of 82) predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation.
- "solar-orig3.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation.
- "solar-orig3.0_kolokacije_collocation_sentence_mapper_metadata.csv" is an extension of the "solar-orig3.0_kolokacije_collocation_sentence_mapper.csv" file that includes school-text metadata.
The dataset can be used for analyses of school writing in Slovene in (Slovene) schools, especially in combination with comparable data (http://hdl.handle.net/11356/2012) from the Slovene textbook corpus Učbeniki 1.0—which presents the expected or desired scope of reception—to identify core student vocabulary.
The data was prepared in the following manner:
In the preprocessing phase, the MULTEXT-East morphosyntactic tags (MSD tags) in the CoNLL-U input corpus were converted from Slovene to their English equivalents because the library then in use did not support Slovene MSD tags.
Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system).
In the postprocessing phase, the MSD tags in the output were translated back into their original Slovene MSD tags.
For more details, see "00README.txt".
---
KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/732
The CLASSLA-Stanza model for UD dependency parsing of spoken Slovenian 2.2
This model for UD dependency parsing of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The estimated LAS of the parser is ~81.91
Monitor corpus of Slovene Trendi 2025-01
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 77 publishers. Trendi 2025-01 covers the period from January 2019 to January 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from January 2025
Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 5.0
ParlaMint-en.ana 5.0 is the English machine translation of the ParlaMint.ana 5.0 (http://hdl.handle.net/11356/2005) set of corpora of parliamentary debates across Europe. The translation keeps the structure and metadata of the original corpora and is linguistically annotated similarly to the original language corpora (but without UD syntax), and with the addition of USAS semantic tags (https://ucrel.lancs.ac.uk/usas/). Because of the addition of semantic tags the UK corpus (ParlaMint-GB) is also included, even though it has, of course, not been machine translated.
The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) using OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level over both speeches and transcriber notes, including headings. Note that corpus metadata is mostly available both in the source language and in English. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/) using the conll03 model (4 classes). The annotation of MWEs (phrases) and tokens with USAS tags was done with pyMusas (https://github.com/ucrel/pymusas).
Note that the English in the corpora contains typical NMT errors, including factual errors even when high fluency is achieved, and any use of this corpus should take the machine translation limitations into account.
The files associated with this entry include the machine translated and linguistically annotated corpora in several formats: the corpora in the canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corpora in the CoNLL-U format with TSV speech metadata. The CoNLL-U files include pyMusas USAS tags. Also included is the 5.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the (open) issues at the GitHub repository of the project.
As opposed to the previous version 4.1, this version adds information on the topic of each speech and the sentence-level sentiment for all corpora, changes the IDs of the categories in corpus-specific taxonomies to prevent ID clashes and corrects some other minor errors
Monitor corpus of Slovene Trendi 2024-12
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 76 publishers. Trendi 2024-12 covers the period from January 2019 to December 2024, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from December 2024
Service for querying dependency treebanks Drevesnik 1.2
Drevesnik (https://orodja.cjvt.si/drevesnik/) is an online service for querying Slovenian corpora parsed with the Universal Dependencies annotation scheme. It features an easy-to-use query language on the one hand and user-friendly graph visualizations on the other.
It is based on the open-source dep_search tool (https://github.com/TurkuNLP/dep_search), which was localized and modified so as to also support querying by JOS morphosyntactic tags, random distribution of results, and filtering by sentence length.
The source code and the documentation for the search backend and the web user interface are publicly available on the CLARIN.SI GitHub repository https://github.com/clarinsi/drevesnik. In comparison to previous version (1.1), release 1.2 introduces a new front-end design and some improved interface features