Common Language Resources and Technology Infrastructure - Slovenia

Not a member yet

699 research outputs found

Sort by

Corpus of scientific texts from the Open Science Slovenia portal OSS 1.0

Author: Žagar Kristjan
Ferme Marko
Ojsteršek Milan
Jemec Tomazin Mateja
Erjavec Tomaž
Publication venue: Faculty of Electrical Engineering and Computer Science, University of Maribor
Publication date: 26/02/2023
Field of study

OSS is a large collection of scientific writing in the Slovenian language gathered from the Open Science Slovenia portal (https://openscience.si). It consists of over 150 thousand monographs, articles, diploma, master's and doctoral theses, advanced textbooks, reviews etc. mostly published between 2000 and 2022 by Slovenian universities, research institutions, etc. Texts are accompanied by metadata, i.e. author, supervisor (for theses), year of publication, publisher (mostly faculties of the various universities), type of publication (according to SICRIS classification), keywords, and CERIF and UDC codes. The texts were obtained directly from PDFs, so it should be noted that they can contain various types of character noise. The texts are linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla) on the levels lemmatisation, MULTEXT-East Version 6 morphosyntactic descriptions, Universal Dependencies part-of-spech and morphological features, and named entities. The corpus is distributed in CoNLL-U and vertical file formats, one file for each text. The text metadata is given as a TSV file. Note that there exist similar, but older and smaller corpora KAS 2.0 (http://hdl.handle.net/11356/1448) and KAS 1.0 (http://hdl.handle.net/11356/1244). These contain only theses and only up to 2018, but are cleaner and with more metadata. The repository also archives a number of KAS-derived datasets; pls. search for "KAS" to find them

The CLASSLA-Stanza model for morphosyntactic annotation of standard Serbian 2.1

Author: Terčon Luka
Ljubešić Nikola
Publication venue: Jožef Stefan Institute
Publication date: 10/05/2023
Field of study

The model for morphosyntactic annotation of standard Serbian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SETimes.SR training corpus (http://hdl.handle.net/11356/1200) combined with the Croatian hr500k training dataset (http://hdl.handle.net/11356/1792) to ensure sufficient representation of certain labels. The CLARIN.SI-embed.sr word embeddings (http://hdl.handle.net/11356/1789) were used during training. The model produces simultaneously UPOS, FEATS and XPOS (MULTEXT-East) labels. The estimated F1 of the XPOS annotations is ~96.19. The difference to the previous version of the model is that this version was trained on the SETimes.SR corpus expanded with the Croatian hr500k training dataset to ensure sufficient representation of certain labels. it was also trained using the new version of Serbian word embeddings

Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
van Noord Rik
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 26/04/2023
Field of study

The Montenegrin-English parallel corpus MaCoCu-cnr-en 1.0 was built by crawling the “.me” internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. In each format, the texts are separated based on the script into two files: a Latin and a Cyrillic subcorpus. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3” which means “paragraph 35 out of 77, sentence 1 out of 3”); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detected”): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine translation identification (“translation-direction”): the source segment in each segment pair was identified by using a probabilistic model (https://github.com/RikVN/TranslationDirection), which also determines if the translation has been produced by a machine-translation system; - a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility (https://github.com/RikVN/DSI); - English language variant: the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/) was identified on document and domain level. Furthermore, the sentence-level TXT format provides additional metadata: - web domain of the text; - source and target document title; - the date when the original file was retrieved; - the original type of the file (e.g., “html”), from which the sentence was extracted; - paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext); - information whether the sentence is a heading or not in the original document. The document-level TXT format provides pairs of documents identified to contain parallel data. In addition to the parallel documents (in base64 format), the corpus includes the following metadata: source and target document URL, a DSI category and the English language variant (British or American). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

The CLASSLA-Stanza model for lemmatisation of standard Bulgarian 2.1

Author: Terčon Luka
Ljubešić Nikola
Osenova Petya
Simov Kiril
Publication venue: IICT-BAS
Publication date: 27/06/2023
Field of study

The model for lemmatisation of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the Bulgarian inflectional lexicon (Popov, Simov, and Vidinska 1998). The estimated F1 of the lemma annotations is ~98.93. The difference to the previous version of the lemmatizer is that this version was trained using the new version of the Bulgarian word embeddings

Slovene learner corpus KOST 2.0

Author: Stritar Kučuk Mojca
Šter Helena
Pisek Staša
Petric Lasnik Ivana
Kete Matičič Jana
Pirih Svetina Nataša
Preglau Daniela
Arhar Holdt Špela
Krsnik Luka
Erjavec Tomaž
Pegan Jasmina
Huber Damjan
Publication venue: Centre for Language Resources and Technologies, University of Ljubljana
Publication date: 31/10/2023
Field of study

The corpus of Slovene as a foreign language KOST (Korpus slovenščine kot tujega jezika) contains 8,347 texts (almost 1.3 million words) written by adult speakers for whom Slovene is not their first language. This corpus offers insights into Slovene language as produced by those who are still learning it as a second or foreign language, and in particular into the most common errors that occur in this process. KOST therefore aims at all those working with Slovene as a second or foreign language. The texts were mainly written at lectorates and Slovene as a L2/FL courses. Most of the authors of these texts speak Serbian, Bosnian and Macedonian as their first language, but texts by speakers of other languages are also included. The authors are at different proficiency levels in Slovene, from beginners to advanced. For each contributor, information is available on gender, year of birth, country, first language and other languages they speak, employment status and education, and prior experience of learning Slovene. For each text, there is also information on the time and circumstances of creation (exam or homework), the programme in which it was produced, input type (digital or hand-written), language level and the grade. A part of the corpus has also texts available in their corrected version. The tokens of the original and corrected texts are linked (one group of link per paragraph) and the links categorised into 23 error types. The corpus is availabe in two formats: (1) TEI encoding of the complete corpus (texts, links), including contributor and text metadata in the TEI header, and (2) the corpus in the original and corrected variants as vertical and registry files, suitable for mounting on CQP-type concordancers. Note that the vertical format does not retain the connection between the original and corrected tokens

Database of the Western South Slavic Verb HyperVerb -- Derivation

Author: Milosavljević Stefan
Mišmaš Petra
Simonović Marko
Arsenijević Boban
Gomboc Čeh Katarina
Marušič Franc Lanko
Simić Jelena
Žaucer Rok
Publication venue: University of Graz
Publication date: 04/07/2023
Field of study

The verbal Western South Slavic database (WeSoSlaV) contains 3000 most frequent Slovenian and 5300 most frequent BCS verbs which are all coded for a number of properties related to verb derivation. The database is a table where each verb is given a row of its own. The coded properties are organized in columns. Verbs in the database are coded for the following properties: root information, whether or not the verb has prefixes and the identity of the included prefix(es), whether or not the verb has suffixes and the identity of the included suffix(es) etc. All coded properties are explained in the accompanying pdf file

Catalan web corpus MaCoCu-ca 1.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
van Noord Rik
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 24/05/2023
Field of study

The Catalan web corpus MaCoCu-ca 1.0 was built by crawling the ".cat", ".es", ".ad", ".fr", ".it" and ".eu" internet top-level domains in 2022, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

The CLASSLA-Stanza model for lemmatisation of non-standard Croatian 2.1

Author: Terčon Luka
Ljubešić Nikola
Štefanec Vanja
Publication venue: Jožef Stefan Institute
Publication date: 10/05/2023
Field of study

The model for lemmatisation of non-standard Croatian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the hr500k training corpus (http://hdl.handle.net/11356/1792) and the ReLDI-NormTagNER-hr corpus (http://hdl.handle.net/11356/1793), using the hrLex inflectional lexicon (http://hdl.handle.net/11356/1232). These corpora were additionally augmented for handling missing diacritics by repeating parts of the corpora with diacritics removed. The estimated F1 of the lemma annotations is ~94.23. The difference to the previous version of the model is that this version is trained on a combination of two corpora (hr500k, ReLDI-NormTagNER-hr)

The CLASSLA-Stanza model for UD dependency parsing of standard Bulgarian 2.1

Author: Terčon Luka
Ljubešić Nikola
Osenova Petya
Simov Kiril
Publication venue: IICT-BAS
Publication date: 27/06/2023
Field of study

The model for UD dependency parsing of standard Bulgarian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the UD-parsed portion of the BulTreeBank training corpus (https://clarino.uib.no/korpuskel/corpora) and using the CLARIN.SI-embed.bg word embeddings (http://hdl.handle.net/11356/1796). The estimated LAS of the parser is ~91.18. The difference to the previous version of the parser is that this version was trained using the new version of the Bulgarian word embeddings

Icelandic web corpus MaCoCu-is 2.0

Author: Bañón Marta
Chichirau Malina
Esplà-Gomis Miquel
Forcada Mikel L.
Galiano-Jiménez Aarón
García-Romero Cristian
Kuzman Taja
Ljubešić Nikola
van Noord Rik
Pla Sempere Leopoldo
Ramírez-Sánchez Gema
Rupnik Peter
Suchomel Vít
Toral Antonio
Zaragoza-Bernabeu Jaume
Publication venue: Universitat d'Alacant
Publication date: 19/05/2023
Field of study

The Icelandic web corpus MaCoCu-is 2.0 was built by crawling the ".is" internet top-level domain in 2021 and 2023, extending the crawl dynamically to other domains as well. The crawler is available at https://github.com/macocu/MaCoCu-crawler. Considerable effort was devoted into cleaning the extracted text to provide a high-quality web corpus. This was achieved by removing boilerplate (https://corpus.tools/wiki/Justext) and near-duplicated paragraphs (https://corpus.tools/wiki/Onion), discarding very short texts as well as texts that are not in the target language. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (https://github.com/bitextor/monotextor), making the corpus highly useful for corpus linguistics studies, as well as for training language models and other language technologies. In XML format, each document is accompanied by the following metadata: title, crawl date, url, domain, file type of the original document, distribution of languages inside the document, and a fluency score based on a language model. The text of each document is divided into paragraphs that are accompanied by metadata on the information whether a paragraph is a heading or not, metadata on the paragraph quality (labels, such as “short” or “good”, assigned based on paragraph length, URL and stopword density via the jusText tool - https://corpus.tools/wiki/Justext) and fluency (score between 0 and 1, assigned with the Monocleaner tool - https://github.com/bitextor/monocleaner), the automatically identified language of the text in the paragraph, and information whether the paragraph contains sensitive information (identified via the Biroamer tool - https://github.com/bitextor/biroamer). As opposed to the previous version, this version has more accurate metadata on languages of the texts, which was achieved by using Google's Compact Language Detector 2 (CLD2) (https://github.com/CLD2Owners/cld2), a high-performance language detector supporting many languages. Other tools, used for web corpora creation and curation, have been updated as well, resulting in an even cleaner, as well as larger corpus. The corpus can be easily read with the prevert parser (https://pypi.org/project/prevert/). Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed to be infringed. (3) Clearly identify the material that is claimed to be infringing and information reasonably sufficient in order to allow us to locate the material. (4) Please write to the contact person for this resource whose email is available in the full item record. We will comply with legitimate requests by removing the affected sources from the next release of the corpus. This action has received funding from the European Union's Connecting Europe Facility 2014-2020 - CEF Telecom, under Grant Agreement No. INEA/CEF/ICT/A2020/2278341. This communication reflects only the author’s view. The Agency is not responsible for any use that may be made of the information it contains

3

full texts

699

metadata records

Updated in last 30 days.

Common Language Resources and Technology Infrastructure - Slovenia is based in Slovenia

Access Repository Dashboard

Do you manage Open Research Online? Become a CORE Member to access insider analytics, issue reports and manage access to outputs from your repository in the CORE Repository Dashboard! 👇