7 research outputs found

    Korpus šolskih besedil slovenskega jezika: zasnova in gradnja

    Get PDF
    This article presents the Corpus of Slovenian School Texts, which is a specialized corpus of written Slovenian containing around 1.8 million tokens. It was designed within the scope of the project Franček, Language Advising Service for Teachers of Slovenian and the Slovenian School Dictionary, and it was intended to provide language material for compilation of Šolski slovar slovenskega jezika (Slovenian School Dictionary), the first research-based school dictionary of Slovenian. The article discusses the text type composition and size of the corpus, sheds light on technical procedures in text preprocessing and corpus annotation, and presents the set of corpus metadata. It also explains in which formats and under what licenses the Corpus of Slovenian School Texts has been made available, and also draws attention to legal aspects of obtaining texts.V prispevku je predstavljen Korpus šolskih besedil slovenskega jezika, specializirani pisni korpus slovenščine v obsegu približno 1,8 milijona pojavnic. Korpus je bil zasnovan v okviru projekta Franček, Jezikovna svetovalnica za učitelje slovenščine in Šolski slovar slovenskega jezika, in sicer kot gradivska osnova za oblikovanje Šolskega slovarja slovenskega jezika, prvega znanstveno utemeljenega pedagoškega slovarja za slovenski jezik. Prispevek obravnava besedilnotipsko sestavo in obseg korpusa, osvetljuje tehnične postopke predpriprave besedil in njihovega jezikoslovnega označevanja ter predstavlja nabor korpusnih metapodatkov, hkrati pa pojasnjuje, v katerih formatih in pod katerimi licencami je Korpus šolskih besedil slovenskega jezika na voljo. Članek opozarja tudi na pravne vidike pridobivanja besedil

    Corpus of scientific texts of contemporary Slovenian KZB 1.0

    No full text
    The Corpus of scientific texts of contemporary Slovenian consists of 25 million words from scientific monographs and scientific papers written mainly between 2000 and 2023. It was designed as one of the resources of the project eSSKJ and corpus - towards state-of-the-art language data. The corpus is linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags (https://nl.ijs.si/ME/V6/msd/html/msd-sl.html), JOS dependency syntax (https://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf), and named entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf). The corpus is available in the CoNLL-U format, as well as vertical files for use with Sketch Engine type concordancers

    Corpus of Slovenian school texts

    No full text
    Corpus of Slovenian school texts is a lemmatized and POS-tagged specialized corpus, which includes 428 short school texts written primarily by primary-school students from 1st to 5th grades from 2017 to 2020. The corpus consists of approximately 95,000 tokens and was designed as one of the resources for the compilation of The School Dictionary of the Slovenian Language, which is being created as part of the project Franček Web Portal, Language Counselling for Slovene Teachers and School Dictionary of the Slovene Language. The corpus was lemmatized and POS-tagged with the Obeliks tagger (http://oznacevalnik.slovenscina.eu/Vsebine/Sl/ProgramskaOprema/Navodila.aspx) using JOS morphosyntactic descriptions. The corpus is written in XML and complies with TEI specifications as given in the CLARIN.SI customisation (https://github.com/clarinsi/TEI-schema)

    Spoken corpus Gos 2.0 (transcriptions)

    No full text
    The spoken corpus Gos 2.0 is the reference speech corpus of the Slovenian language. This second edition contains about 300 hours of speech, or 2.4 million words, 127 thousand utterances and 1,500 texts. Gos 2.0 is composed from three different sources: (1) Spoken corpus Gos 1.1 (http://hdl.handle.net/11356/1438), 112 hours, 1 million words (2) Spoken corpus Gos VideoLectures 4.2 (http://hdl.handle.net/11356/1444), 22 hours, 179,000 words (3) A selection from the ASR database ARTUR 1.0 (http://hdl.handle.net/11356/1772), 185 hours, 1.2 mllion words, including: (3a) Artur-J-Splosni, 62 hours, 422,000 words: transcriptions of media recordings, online recordings of conferences, workshops, education videos, etc. (3b) Artur-N-Prosti, 61 hours, 324,000 words: transcriptions of monologues and dialogues between two persons, recorded for the purposes of the Artur database. Speakers were asked to freely conversate or freely explain on casual topics. (3c) Artur-P-SejeDZ, 62 hours, 450,000 words: a selection of transcriptions of speech from the Slovene National Assembly. The maximum length of single speaker speech is 4,000 words. Note that various encoding changes have been made to the original Gos and Gos VideloLectures corpora so that the encoding of Gos 2.0 is uniform across the three sources. All transcriptions are manual and made in two modes: - pronunciation-based or citation-phonemic transcriptions (containing the output phoneme string derived from the orthographic form by letter-to-sound rules) - standardised or expanded orthographic transcriptions (the standard Slovene spelling is used to indicate the spoken words, but there are additional rules and word-lists for non-standard lexis). Part-of-speech tagging with MULTEXT-East morphosyntactic descriptions and lemmatisation was performed automatically with CLASSLA (https://github.com/clarinsi/classla). The corpus is distributed in TEI (XML) format and in vertical file format, the latter used by the CQP familiy of concordancers, such as (no)Sketch Engine

    ASR database ARTUR 1.0 (audio)

    No full text
    Artur 1.0 is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,067 hours of speech. 884 hours are transcribed, while the remaining 183 hours are recordings only. This repository entry includes audio files only, the transcriptions are available on http://hdl.handle.net/11356/1772. The data are structured as follows: (1) Artur-B, read speech, 573 hours in total. It includes: (1a) Artur-B-Brani, 485 hours: Readings of sentences which were pre-selected from a 10% increment in the Gigafida 2.0 corpus. The sentences were chosen in such a way that they reflect the natural or the actual distribution of triphones in the words. They were distributed between 1,000 speakers, so that we recorded approx. 30 min in read form from each speaker. The speakers were balanced according to gender, age, region, and a small proportion of speakers were non-native speakers of Slovene. Each sentence is its own audio file and has a corresponding transcription file. (1b) Artur-B-Crkovani, 10 hours: Spellings. Speakers were asked to spell abbreviations and personal names and surnames, all chosen so that all Slovene letters were covered, plus the most common foreign letters. (1c) Artur-B-Studio, 51 hours: Designed for the development of speech synthesis. The sentences were read in a studio by a single speaker. Each sentence is its own audio file and has a corresponding transcription file. (1d) Artur-B-Izloceno, 27 hours: The recordings include different types of errors, typically, incorrect reading of sentences or a noisy environment. (2) Artur-J, public speech, 62 hours in total. It includes: (2a) Artur-J-Splosni, 62 hours: media recordings, online recordings of conferences, workshops, education videos, etc. (3) Artur-N, private speech, 74 hours in total. It includes: (3a) Artur-N-Obrazi, 6 hours: Speakers were asked to describe faces on pictures. Designed for a face-description domain-specific speech recognition. (3b) Artur-N-PDom, 7 hours: Speakers were asked to read pre-written sentences, as well as to express instructions for a potential smart-home system freely. Designed for a smart-home domain-specific speech recognition. (3c) Artur-N-Prosti, 61 hours: Monologues and dialogues between two persons, recorded for the purposes of the Artur database creation. Speakers were asked to conversate or explain freely on casual topics. (4) Artur-P, parliamentary speech, 201 hours in total. It includes: (4a) Artur-P-SejeDZ, 201 hours: Speech from the Slovene National Assembly. Further information on the database are available in the Artur-DOC file, which is part of this repository entry

    ASR database ARTUR 0.1 (audio)

    No full text
    ARTUR is a speech database designed for the needs of automatic speech recognition for the Slovenian language. The database includes 1,035 hours of speech, although only 840 hours are transcribed, while the remaining 195 hours are without transcription. The data is divided into 4 parts: (1) approx. 520 hours of read speech, which includes the reading of pre-defined sentences, selected from the Gigafida 2.0 corpus (http://hdl.handle.net/11356/1320); each sentence is contained in one file; speakers are demographically balanced; spelling is included in special files; all with manual transcriptions; (2) approx. 204 hours of public speech, which includes media recordings, online recordings of conferences, workshops, education videos, etc.; 56 hours are manually transcribed; (3) approx. 110 hours of private speech, which includes monologues and dialogues between two persons, recorded for the purposes of the speech database; the speakers are demographically balanced; two subsets for domain-specific ASR (i.e., smart-home and face-description) are included; 63 hours are manually transcribed; (4) approx. 201 hours of parliamentary speech, which includes recordings from the Slovene National Assembly, all with manual transcriptions. Audio files are WAV 44,1 kHz, pcm, 16-bit, mono. This entry includes the recordings only; transcriptions are available at http://hdl.handle.net/11356/1718
    corecore