22 research outputs found

    Slovenščina Janes: pogovorna, nestandardna, spletna ali spretna?

    Get PDF
    V sklopu konference Slovenščina na spletu in v novih medijih je 27. novembra 2015 v dvorani Zemljepisnega muzeja GIAM ZRC SAZU potekala okrogla miza z naslovom Slovenščina Janes: pogovorna, nestandardna, spletna ali spretna? K razpravi je bilo povabljenih pet strokovnjakov in strokovnjakinj s področja slovenskega jezikoslovja: dr. Helena Dobrovoljc (Inštitut za slovenski jezik Frana Ramovša ZRC SAZU in Fakulteta za humanistiko UNG), dr. Polona Gantar (Filozofska fakulteta UL), dr. Simon Krek (Inštitut Jožef Stefan, Filozofska fakulteta UL in Fakulteta za družbene vede UL), dr. Damjan Popič (Filozofska fakulteta UL) in dr. Marko Stabej (Filozofska fakulteta UL). Razpravo sem moderirala dr. Špela Arhar Holdt (Zavod za uporabno slovenistiko Trojina in Filozofska fakulteta UL). Povod za okroglo mizo so bile terminološke zadrege, zaznane pri poskusu poimenovanja jezika v korpusu Janes,[1] vendar so slednje zelo hitro razkrile širok spekter kompleksnih vzrokov. Vprašanje opredelitve »slovenščine Janes« se tako zastavlja kot rezultat sprememb v načinu človeške komunikacije, po katerih opredelitve in pojmi obstoječe slovenske (in ne le slovenske) zvrstnostne teorije izgubljajo uporabno vrednost. Je mogoče po pojavu spleta in razvoju različnih zvrsti računalniško posredovane komunikacije še govoriti o javnem in zasebnem, formalnem in neformalnem, knjižnem in pogovornem? Še več, so te kategorije v praksi – v šoli in izven nje – sploh kdaj funkcionirale? Debata se je dotaknila vprašanja, kako naj se jezikoslovje na spremembe v jezikovni rabi odzove: moramo zagotoviti predvsem novo zvrstnostno teorijo ali je potrebna tudi sprememba v odnosu do jezikovnih uporabnikov, slovenistične metodologije, izdelkov in storitev, ki jih jezikovna skupnost od nas pričakuje, jezika samega? In kakšna je v sliki sodobnih jezikoslovnih raziskav in projektov vloga gradiva Janes, kje so glavne možnosti in kaj omejitve? Na začetku debate je imel vsak od panelistov nekaj minut za predstavitev izhodiščnega mnenja, sledile so replike in na koncu še vprašanja oz. mnenja udeležencev. Zapis izjav je bil pripravljen po zvočnem posnetku, pri čemer so bile izjave za namene lažjega branja skladenjsko prilagojene značilnostim pisnega jezika, nato pa so avtorji posredovali še nekaj dodatnih pojasnil glede svojih prispevkov. Zapis začenjamo s predstavitvijo prvega panelista. [1] Gre za korpus računalniško posredovane komunikacije, ki zajema besedila tvitov, blogov, uporabniških komentarjev in forumov. Korpus predstavlja prispevek (Erjavec in dr. 2015), projektna stran pa je: http://nl.ijs.si/janes/

    Nova slovnica: kje smo in kam gremo

    Get PDF
    6. junija 2018 je na Inštitutu Jožef Stefan potekal dogodek, na katerem so bili javnosti predstavljeni cilji in prvi rezultati projekta Nova slovnica sodobne standardne slovenščine: viri in metode (ARRS J6-8256). Namen projekta je razviti jezikoslovno metodologijo za računalniško podprto analizo sodobne slovenščine, kakršna je zajeta v referenčnih besedilnih korpusih slovenskega jezika. Z novo metodologijo bodo pripravljene baze jezikovnih podatkov, ki bodo po koncu projekta skupnosti odprto na voljo za raziskave, gradnjo jezikovnih priročnikov ter učnih gradiv, razvoj jezikovnotehnoloških orodij ipd. Omenjeno projektno financiranje izdelave nove slovnice sicer ne pokriva, vendar že priprava podatkovnih baz zahteva premisleke o trenutnih prioritetah slovenskega prostora. Sodobni slovnični opis je brez dvoma med cilji za prihodnost, ni pa še v konsenza, kako naj bo oblikovan, da bo odgovoril na (različne) potrebe sodobne družbe. Da odpremo razpravo, smo na projektnem dogodku organizirali strokovni posvet, opredeljen z naslednjimi vprašanji: kdo so deležniki, ki bi lahko projektne rezultate uporabljali; na kaj moramo pri pripravi paziti, da bodo podatki optimalno uporabni; kakšno oz. katero slovnico potrebujemo najprej; katere so metodološke in logistične premise njene priprave; kje je trenutno slovensko slovničarstvo in kakšen razvoj si lahko obetamo; kakšne so potrebe po slovničnih podatkih pri različnih uporabniških skupinah ter kaj bi trenutne vrzeli najbolje naslovilo

    D3.8 Lexical-semantic analytics for NLP

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

    Sports unit of the Slovenian Armed Forces: sport climbing

    Full text link
    Korenine Slovenske vojske segajo pred nastanke samostojne Slovenije, prav tako pa tudi športna enota znotraj le nje. Šport je velik sestavni del vseh oboroženih sil po svetu, ne le v smislu fizične pripravljenosti vojakov, temveč tudi na način povezanosti vrhunskih športnikov s Slovensko vojsko. Ker spada Republika Slovenija v razmeroma hribovito državo je za nas plezanje še posebej pomembna športna disciplina Poleg tega je športno plezanje zadnjih nekaj let v velikem porastu, uvrstilo se je tudi na prihajajoče olimpijske igre, ki bodo leta 2020. Vrhunski športnik iz športna enota Slovenske vojske, ki spada v podporno enoto ima tudi nalogo zaposlovanja športnikov in trenerjev kakor tudi promocijo Slovenske vojske, vključevanje športnikov v proces usposabljanja Slovenske vojske, pomoč pri izvedbi preverjanja gibalnih sposobnosti, priprava in organizacija srečanja športnikov zaposlenih v Športni enoti.The roots of the Slovenian Armed Forces, as well as the sports unit within the body, stretch before the emergence of an independent Slovenia. Sports is a major component of all armed forces around the world, not only in terms of the physical readiness of soldiers, but also on the way links between top sportsmen and the Slovenian Armed Forces. Since the Republic of Slovenia belongs to a relatively hilly country, climbing is especially important for sports discipline. In addition, sports climbing has been on the rise in recent years, and it has been ranked also in the upcoming Olympic Games in 2020. The top sportsman from the Slovenian Armed Forces , which belongs to the support unit, also has the task of employing athletes and trainers, as well as the promotion of the Slovenian Armed Forces, the involvement of athletes in the training process of the Slovenian Armed Forces, assistance in carrying out the checking of physical fitness, the preparation and organization of a meeting of athletes in the Sports Unit

    Frequency lists of collocations from the Gigafida 2.1 corpus

    No full text
    Frequency lists of collocations were extracted from the Gigafida 2.1 Corpus of Written Standard Slovene (https://www.clarin.si/noske/run.cgi/corp_info?corpname=gfida21) using specialised scripts for extraction of data from syntactically parsed corpora. The lists contain collocations with absolute frequency 10 and above, split into files corresponding to 81 predefined syntactic structures. The formal description of syntactic structures with information on restrictions and representations applied to POS and dependency parsing annotations is included in the dataset. The lists are sorted according to absolute frequency of collocations and include frequency information on individual lemmas, together with the most frequent representative forms of combined lemmas. The lists also include calculation of logDice score for collocations, and the number of distinct forms of lemmas appearing in corpus hits for a particular collocation

    Training corpus ssj500k 2.0

    No full text
    The ssj500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, and lemmatisation. About half of the corpus is also manually annotated with syntactic dependencies, named entities, and verbal multiword expressions. The annotations of the ssj500k corpus follow (1) the MULTEXT-East V5 morphosyntactic specifications for Slovene, http://nl.ijs.si/ME/V5/msd/, (2) the JOS dependency schema, http://nl.ijs.si/jos/bib/jos-skladnja-navodila.pdf, (3) the Janes Annotation guidelines for Slovenian named entities, http://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf, and the Guidelines of the PARSEME shared task on verbal multiword expressions, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.0/ The vocabulary of (1) and (2) is provided in the back element and (3) and (4) in the teiHeader of the TEI encoded corpus

    Collocations Dictionary of Modern Slovene KSSS 1.0

    No full text
    The database of the Collocations Dictionary of Modern Slovene 1.0 contains entries for 35,862 headwords (18,043 nouns, 5,148 verbs, 10,259 adjectives and 2,412 adverbs) and 7,310,983 collocations that were automatically extracted from the Gigafida 1.0 corpus. For the automatic extraction via the Sketch Engine API we used a specially adapted Sketch grammar for Slovene, and, based on manual evaluation, a set of parameters that determined: maximum number of collocates per grammatical relation, minimum frequency of a collocate, minimum frequency of a grammatical relation, minimum salience (logDice) score of a collocate, and minimum salience of a grammatical relation. The procedure of automatic extraction, which produced a list of collocates (lemmas) in a particular relation, was followed by a set of post-processing steps: - removal of collocations that were represented by repetitions of the same sentence - preparation of full collocations by the addition of the headword, and, if needed, the third element in the grammatical relation (such as preposition). The headwords/collocates were also put in the correct case, depending on the grammatical relation. - addition of IDs from the Slovenian morphological lexicon Sloleks (http://hdl.handle.net/11356/1230) to every element in the collocation

    Thesaurus of Modern Slovene 1.0

    No full text
    This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a monolingual dictionary, and a corpus. A network analysis on the bilingual dictionary word co-occurrence graph was used, together with additional information from the distributional thesaurus data available as part of the Sketch Engine tool and extracted from the 1.2 billion word Gigafida corpus and the monolingual dictionary

    Thesaurus of Modern Slovene 1.0 (ELEXIS)

    No full text
    Slovar sopomenk sodobne slovenščine 1.0. This is an automatically created Slovene thesaurus from Slovene data available in a comprehensive English–Slovenian dictionary, a monolingual dictionary, and a corpus. A network analysis on the bilingual dictionary word co-occurrence graph was used, together with additional information from the distributional thesaurus data available as part of the Sketch Engine tool and extracted from the 1.2 billion word Gigafida corpus and the monolingual dictionary. See also: http://hdl.handle.net/11356/116

    The Orange workflow for observing collocation trends ColTrend 1.0

    No full text
    The Orange workflow for observing collocation trends ColTrend 1.0 ColTrend is a workflow (.OWS file) for Orange Data Mining (an open-source machine learning and data visualization software: https://orangedatamining.com/) that allows the user to observe temporal collocation trends in corpora. The workflow consists of a series of Python scripts, data filters, and visualizers. As input, the workflow takes a .CSV file with data on collocations and their relative frequencies by year of publication extracted from a corpus. As output, it provides a .TSV file containing the same data (or a filtered selection thereof) enriched with four measures that indicate the collocation’s temporal trend in the corpus: (1) the slope (k) of a linear regression model fitted to the frequency data, which indicates whether the frequency of use of the collocation is increasing or declining; (2) the coefficient of determination (R2) of the linear regression model, indicating how linear the change in the collocation’s use is; (3) the ratio (m) of maximum relative frequency and average relative frequency, which indicates peaks in collocation usage; and (4) the coefficient of recent growth (t), which indicates an increased usage of the collocation in the last three years of the observed corpus data. The entry also contains three .CSV files that can be used to test the workflow. The files contain collocation candidates (along with their relative frequencies per year of publication) extracted from the Gigafida 2.0 Corpus of Written Slovene (https://viri.cjvt.si/gigafida/) with three different syntactic structures (as defined in http://hdl.handle.net/11356/1415): 1) p0-s0 (adjective + noun, e.g. rezervni sklad), 2) s0-s2 (noun + noun in the genitive case, e.g. ukinitev lastnine), and 3) gg-s4 (verb + noun in the accusative case, e.g. pripraviti besedilo). It should be noted that only collocation candidates with absolute frequency of 15 and above were extracted. Please note that the ColTrend workflow requires the installation of the Text Mining add-on for Orange. For installation instructions as well as a more detailed description of the different phases of the workflow and the measures used to observe the collocation trends, please consult the README file
    corecore