Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
    816 research outputs found

    Frequency list of collocations from the Učbeniki 1.0 corpus

    No full text
    The frequency list of collocations from the Slovene textbook corpus Učbeniki 1.0 was extracted with the CORDEX library (https://github.com/clarinsi/cordex/). The extraction is based on 82 predefined syntactic structures (cf. Krek et al., 2021) using the MULTEXT-East morphosyntactic (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and JOS-SYN dependency parsing (https://wiki.cjvt.si/books/06-jos-syn-syntax) annotations, where the latter serves as a syntactic complement to the former. The formal description of syntactic structures is included in the CORDEX library (see "structures_JOS.xml"). There are 2 output files: - "ucbeniki1.0_kolokacije.csv" contains the original output of collocations with absolute frequency 1 and above, corresponding to 82 predefined syntactic structures. The list is sorted by absolute frequency of collocations (Joint_representative_form) and includes frequency and POS information for each lemma of the collocation. The file also provides additional statistical measures (Delta_p12, Delta_p21, LogDice_core, LogDice_all) and shows the number of distinct forms in which the lemmas appear in the corpus for each collocation. - "ucbeniki1.0_kolokacije_collocation_sentence_mapper.csv" complements the file above by showing all occurrences of the extracted collocations in the corpus. Each row lists a collocation ID (matching the first file), identifies the sentence in which the collocation appears, and provides the exact tokens that form the collocation. The dataset can be used for analyses, especially in combination with comparable data (http://hdl.handle.net/11356/2011) from the develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589) to identify core student vocabulary. The data was prepared in the following manner: In the preprocessing phase, all individual Slovene school textbooks were merged into a single CoNLL-U file. Because the library then in use did not support Slovene MULTEXT-East morphosyntactic tags (MSD tags), these tags were converted into their English equivalents. Next, collocation data were extracted using the CORDEX library. Any collocations containing punctuation were excluded from the output. The lookup lexicon (https://www.clarin.si/repository/xmlui/handle/11356/1854) was used to improve collocation representations (applicable only when using the JOS system). In the postprocessing phase, the MSD tags in the output were translated back into Slovene MSD tags. For more details, see "00README.txt". --- KREK, Simon, GANTAR, Polona, KOSEM, Iztok, DOBROVOLJC, Kaja. Opis modela za pridobivanje in strukturiranje kolokacijskih podatkov iz korpusa. V: ARHAR HOLDT, Špela (ur.). Nova slovnica sodobne standardne slovenščine : viri in metode. 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete, 2021. Str. 160-194, ilustr. Zbirka Sporazumevanje. https://ebooks.uni-lj.si/ZalozbaUL/catalog/view/325/477/732

    The "Mobile languages" corpus MoJezik 1.0 (transcription)

    No full text
    The "Mobile Languages" corpus documents in-depth, semi-structured sociolinguistic interviews with speakers from two Slovene regions and distinctive dialects: Idrija (Cerkno dialect, Rovte dialect group) and Ribnica (Lower Carniola dialect, Lower Carniola dialect group), who study or work in the Slovenian capital, Ljubljana, and thus navigate daily between dialectal and standard language use. Interview topics include narratives of personal (linguistic) history, reflections on past and present language practices, attitudes towards their own dialects and other Slovene varieties, experiences of dialect perception in the Ljubljana context and of standard-like speech in local environments, linguistic identity, stereotypes and prejudices, intergenerational language use (especially with children), and language behaviour in educational settings. The corpus includes: – Idrija group: 5 speakers (3 women, 2 men; 3 adults, 2 secondary-school students), recorded between 2009 and 2013; 1,112 transcribed utterances, 31,506 transcribed words. – Ribnica group: 11 speakers (3 primary informants and 8 close contacts, including family members, friends, and colleagues), recorded between 2020 and 2022; 2,889 transcribed utterances, 47,364 transcribed words. The transcriptions are orthographic, with selected non-standard features preserved using special symbols to capture salient dialectal elements (e.g., the fricative [γ] and the bilabial glide [w] in the Cerkno variety). Speaker names have been anonymised. While transcription prioritised content and was performed by multiple transcribers, consistency in the phonetic rendering of dialectal features was not systematically verified. Users should be aware that detailed phonological analysis may require additional checking. The interviews were conducted within the framework of broader sociolinguistic research, which also encompassed informants’ self-recordings of spontaneous speech in diverse everyday situations and a quantitative variationist analysis of five phonological variables (dialect-specific) across various communicative contexts. The interview data enable comparisons between speakers’ metalinguistic commentary and their actual language use as documented in the recordings. The findings of the Cerkno and Ribnica studies are comprehensively presented in two scientific publications: * Bitenc, Maja, 2016: Z jezikom na poti med Idrijskim in Ljubljano [With Language on the Move Between Idrija and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete. * Bitenc, Maja (in press): Govor v gibanju med Ribnico in Ljubljano [Speech in Motion Between Ribnica and Ljubljana]. Ljubljana: Znanstvena založba Filozofske fakultete. The corpus speech files for speakers who have consented to the publication of their recordings are available as a separate entry: The "Mobile languages" corpus MoJezik 1.0 (audio), http://hdl.handle.net/11356/2042

    Syntactic Tree Inventories from Slovenian UD Corpora (v2.15)

    No full text
    This dataset contains lists of delexicalized dependency trees and subtrees extracted from the Slovenian UD corpora SSJ (written) and SST (spoken), version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic set of syntactic structures in Slovenian, useful for data-based investigations of syntactic patterns in Slovenian and their variation across the two modalities. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). Structures were extracted from three versions of each corpus: (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., spoken SST) * An example (e.g., samostojna <amod država) * Frequency in the corresponding reference corpus (e.g., written SSJ) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF) The STARK configuration file used in the extraction process is included

    Monitor corpus of Slovene Trendi 2025-06

    No full text
    The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 57 publishers. Trendi 2025-06 covers the period from January 2019 to June 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320). The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics). The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]). This version adds texts from June 2025, plus texts from two sources missing from May 2025

    Parallel sense-annotated corpus ELEXIS-WSD 1.2

    No full text
    ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.2 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene. The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language. The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2. List of sense inventories BG: Dictionary of Bulgarian DA: DanNet – The Danish WordNet EN: Open English WordNet ES: Spanish Wiktionary ET: The EKI Combined Dictionary of Estonian HU: The Explanatory Dictionary of the Hungarian Language IT: PSC + Italian WordNet NL: Open Dutch WordNet PT: Portuguese Academy Dictionary (DACL) SL: Digital Dictionary Database of Slovene The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl). Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between. For more information, please refer to 00README.txt. Updates in version 1.2: - Several tokenization errors with multiword tokens were fixed in all subcorpora (e.g. the order of subtokens was incorrect in many cases; the issue has now been resolved). - XPOS, FEATS, HEAD, and DEPREL columns were added automatically with UDPipe (except for elexis-wsd-sl and elexis-wsd-et; for Slovene, all columns were manually validated; for Estonian, HEAD and DEPREL were manually validated; all other languages contain automatic tags in these columns – for more information on the models used and their performance, see 00README.txt). - The entry now includes lists of potential errors in automatically assigned XPOS and FEATS values. In previous versions, only UPOS tags were manually annotated, while the XPOS and FEATS columns were left empty. XPOS and FEATS have now been added automatically through UDPipe. The list of potential errors contains the list of lines in the corpus in which the XPOS and FEATS columns are potentially incorrect because the manually validated UPOS tag differs from the automatically assigned UPOS tag, which indicates that the automatically assigned XPOS and FEATS columns are probably incorrect. This is meant as a reference for future validation efforts. - For Slovene, named entity annotations were added based on the annotations from the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959)

    The CLASSLA-Stanza model for morphosyntactic annotation of spoken Slovenian 2.2

    No full text
    This model for morphosyntactic annotation of spoken Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SST treebank of spoken Slovenian (https://github.com/UniversalDependencies/UD_Slovenian-SST) combined with the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl word embeddings (http://hdl.handle.net/11356/1791) that were expanded with the MaCoCu-sl Slovene web corpus (http://hdl.handle.net/11356/1517). The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.76

    Multilingual comparable corpora of parliamentary debates ParlaMint 5.0

    No full text
    ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker) as well as by their automatically assigned CAP (Comparative Agendas Project) top level topic. The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). An overview of the statistics of the corpora is available on GitHub in the folder Build/Metadata, in particular for the release 5.0 at https://github.com/clarin-eric/ParlaMint/tree/v5.0/Build/Metadata. The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 5.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint. Note that there also exists the linguistically marked-up version of the 5.0 ParlaMint corpus (http://hdl.handle.net/11356/2005) as well as a version machine translated to English (http://hdl.handle.net/11356/2006). Both are linked with CLARIN.SI concordancers for on-line analysis. As opposed to the previous version 4.1, this version adds information on the topic of each speech for all corpora, changes the IDs of the categories in corpus-specific taxonomies to prevent ID clashes and corrects some other minor errors

    Syntactic Tree Inventories from English GUM UD Corpus (v2.15)

    No full text
    This dataset contains lists of delexicalized dependency trees and subtrees extracted from the English UD GUM corpus, version 2.15 (http://hdl.handle.net/11234/1-5787), using the STARK tool (https://github.com/clarinsi/STARK). These lists represent a basic inventory of syntactic structures in English, supporting data-driven investigations into syntactic patterns and their variation across modalities. The GUM corpus was divided into spoken and written subsets based on the original genre classifications. The spoken subset includes interviews, conversations, podcasts, vlogs, courtroom transcripts, and speeches, while the written subset includes news articles, academic texts, fiction, how-to guides, biographies, essays, letters, textbooks, and travel guides. Each structure is represented as a fixed-order labeled dependency tree or subtree with UPOS tags as nodes (e.g., ADJ <amod NOUN). For each of the two subcorpora (spoken and written), structures were extracted in three versions (1) The full version (2) A version excluding punctuation (i.e., branches labeled as punct) (3) A version excluding disfluencies (i.e., branches labeled as punct, reparandum, or discourse) The extracted structures are provided in tabular TSV format. Each row contains: * The delexicalized tree/subtree (e.g., ADJ <amod NOUN) * Its absolute and relative frequency in the target corpus (e.g., GUM-spoken) * An example (e.g., nice <amod example) * Frequency in the corresponding reference corpus (e.g., GUM-written) * Keyness measures for modality-based comparison (e.g., LL, Odds Ratio, %DIFF

    Slovene Conformer CTC BPE E2E Automated Speech Recognition model PROTOVERB-ASR-E2E 1.0

    No full text
    This Conformer CTC BPE E2E Automated Speech Recognition model was trained following the NVIDIA NeMo Conformer-CTC fine-tuning recipe (for details see the official NVIDIA NeMo NMT documentation, https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/intro.html, and NVIDIA NeMo GitHub repository https://github.com/NVIDIA/NeMo). It provides functionality for transcribing Slovene speech to text. The starting point was the Conformer CTC BPE E2E Automated Speech Recognition model RSDO-DS2-ASR-E2E 2.0, which was fine-tuned on the Protoverb closed dataset. The model was fine-tuned for 20 epochs, which improved the performance on the Protoverb test dataset for 9.8% relative WER, and for 3.3% relative WER on the Slobench dataset

    Collection of Slovenian riddles Uganke 1.0

    No full text
    The Uganke corpus collects 2,790 Slovenian riddles from the folklore collection of the Institute of Slovenian Ethnology. The riddles come from 171 sources: fieldwork, newspapers, journals, manuscripts and printed riddle collections from the 19th and 20th centuries. The material is categorised into eight types, depending on the content, semantics, length and presumed context of the riddle: true riddle, narrative true riddle, joking question, wisdom question, joking wisdom question, logical riddle, neck riddle, sexual riddle. Each riddle is split into the question and answer part, and each is given in the diplomatic transcription, mirroring the riddle in the source document, and the critical transcription, which is brought closer to the contemporary Slovenian standard orthography. The critical transcriptions have been automatically annotated with lemmas, MULTEXT-East morphosyntactic descriptions (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html) and Universal dependencies (https://universaldependencies.org/) with the CLASSLA toolchain (https://github.com/clarinsi/classla). The canonical encoding of the corpus is TEI, but the corpus is also distributed in two derived encodings. One is the riddles and the bibliography as two TSV files, and the other the vertical file with the linguistically annotated riddles, as used by CQP-type concordancers, such as Sketch Engine

    5

    full texts

    816

    metadata records
    Updated in last 30 days.
    Common Language Resources and Technology Infrastructure - Slovenia is based in Slovenia
    Access Repository Dashboard
    Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇