80 research outputs found

    Delavnice JANES Ekspres za promocijo korpusnih in spletnih virov za slovenščino

    Get PDF
    Novembra in decembra 2015 so Filozofska fakulteta Univerze v Ljubljani, slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije CLARIN.SI ter regionalna iniciativa za jezikovne podatke ReLDI organizirale dogodek JANES Ekspres, ki ga je v okviru razpisa za promocijo slovenske znanosti v tujini sofinancirala Javna agencija za raziskovalno dejavnost Republike Slovenije (ARRS). Cilj projekta je bil raziskovalcem in študentom v Sloveniji, na Hrvaškem in v Srbiji s predavanji in delavnicami predstaviti obstoječe korpusne ter spletne vire za slovenščino in gradnjo, označevanje ter analizo korpusa spletne slovenščine JANES, ki nastaja v okviru temeljnega raziskovalnega projekta JANES

    Oblikoslovni vzorci v leksikonu Sloleks: izhodiščni nabor za samostalnike

    Get PDF
    Prispevek predstavlja prvi korak k dopolnjevanju leksikona Sloleks z oblikoslovnimi vzorci, in sicer na primeru samostalnikov. Vzorci so v prvem koraku strojno pridobljeni iz leksikona samega na osnovi izbranih razločevalnih lastnosti (oblikoskladenjskih oznak in spremenljivih delov besednih oblik). Sledi ročno razvrščanje, v katerem (a) ločimo sistemsko in v rabi utemeljene vzorce od primerov, ki se pojavljajo spričo šuma pri strojnem luščenju in nedoslednosti v leksikonu Sloleks; (b) uredimo skupine glede na vsebovanost in sorodnost; (c) poiščemo in natančneje opredelimo variantnost, tako pri standardnih kot nestandardnih oblikah; (d) začrtamo korake za nadaljnji razvoj programa in leksikonske nadgradnje. Rezultat je izhodiščni nabor formaliziranih oblikoslovnih vzorcev za (občno- in lastnoimenske) samostalnike, ki prinaša 10 skupin (64 vzorcev) za moški spol, 9 skupin (29 vzorcev) za ženski spol in 8 skupin (20 vzorcev) za srednji spol. Priprava nabora vzorcev je razkrila številne možnosti za izboljšavo leksikona, strojno namenski pogled na pregibanje pa priložnosti za dopolnitev slovničnega opisa slovenščine. V nadaljevanju dela bodo vzorci pripravljeni tudi za preostale besedne vrste in dopolnjeni s korpusnim gradivom. Končna nomenklatura bo vpisana v bazo leksikona Sloleks, v obliki strojno berljivih vzorcev pa bo objavljena tudi na repozitoriju Clarin.si

    28. evropska poletna šola jezika, logike in informatike ESSLLI 2016

    Get PDF
    28. evropska poletna šola jezika, logike in informatike ESSLLI 201

    »Nočem biti odvisna«: Ali javne prevajalske in tolmaške storitve res negativno vplivajo na aktivno vključenost migrantov v državo gostiteljico?

    Get PDF
    By challenging some of the existing political claims regarding translation and interpreting provision to migrants, the article argues for new approaches in language policies related to translation and interpreting services. The research attempts to respond to the claims that translation and interpreting impedes integration of recent immigrants by conducting a quantitative and qualitative research among a group of asylum seekers settled in a detention centre in Ljubljana, Slovenia. First, we gathered data on the structure and language profiles of all the residents in the detention centre in August 2014 (56 residents from 19 different countries); then a representative group of 18 asylum seekers in terms of their first language was selected and put into 2 groups based on their length of stay in Slovenia at the time of their interview (shorter vs. longer periods). A questionnaire was used to gather quantitative data on the language profiles, while the qualitative data was obtained through semi-structured interviews in 2014 and two repeat interviews in 2015. A narrative analysis of the transcriptions of all recorded interviews was made, focusing on different languages and communication solutions in different stages of a migrant’s life in the host country. The results show that basic trade-offs are possible: translation and interpreting are complementary steps to independence, which assist rather than impede acquisition of the dominant, i.e. national language, of the host country.Prispevek se sprašuje o upravičenosti določenih političnih ukrepov, ki odrekajo zagotavljanje prevajanja in tolmačenja migrantom, ter zagovarja nove pristope k jezikovni in prevodni politiki, in sicer prek raziskave, ki je bila zasnovana kot odziv na trditve, da prevajalske in tolmaške storitve ovirajo integracijo nedavnih priseljencev. Raziskava je bila izvedena na skupini prosilcev za mednarodno zaščito v azilnem domu v Ljubljani. Najprej smo zbrali podatke o jezikovnem ozadju vseh stanovalcev azilnega doma v avgustu 2014 (56 stanovalcev iz 19 različnih držav), nato pa sestavili reprezentativno skupino 18 prosilcev za mednarodno zaščito na podlagi njihovega maternega jezika in jih razdelili na dve skupini glede na čas bivanja v Sloveniji v času intervjuja (krajše ali daljše obdobje). Kvantitativne podatke o jezikovnih profilih smo zbrali z vprašalnikom, kvalitativne pa s pomočjo polstrukturiranih intervjujev, izvedenih v letu 2014, in dveh ponovitvenih intervjujev v letu 2015. Nato smo kvalitativno analizirali transkripcije vseh posnetih intervjujev, pri čemer smo se osredotočali na jezikovne in komunikacijske rešitve na različnih stopnjah migrantovega življenja v državi gostiteljici. Rezultati kažejo, da je osnovne kompromise mogoče doseči, saj so prevajalske in tolmaške storitve komplementarni koraki do neodvisnosti in kot take ne ovirajo učenja dominantnega oz. nacionalnega jezika države gostiteljice, temveč ga podpirajo

    The attitude of dictionary users towards automatically extracted collocation data: A user study

    Get PDF
    The paper is based on a survey conducted within the framework of the basic research project Collocations as a Basis for Language Description: Semantic and Temporal Perspectives (KOLOS; J6-8255). It presents a qualitative analysis of a user evaluation of the interface of the Collocations Dictionary of Modern Slovene (CDS). It discusses an alternative perspective—the user's point of view—on problematic aspects of individual dictionary features, which require further lexicographic analysis and discussion. The collocations user study presents a model of the process of user evaluation; its findings are significant primarily for determining problems encountered by users. They also serve as a useful basis for methodology improvements in future, comparable lexicographic user studies and analyses

    D3.8 Lexical-semantic analytics for NLP

    Get PDF
    UIDB/03213/2020 UIDP/03213/2020The present document illustrates the work carried out in task 3.3 (work package 3) of ELEXIS project focused on lexical-semantic analytics for Natural Language Processing (NLP). This task aims at computing analytics for lexical-semantic information such as words, senses and domains in the available resources, investigating their role in NLP applications. Specifically, this task concentrates on three research directions, namely i) sense clustering, in which grouping senses based on their semantic similarity improves the performance of NLP tasks such as Word Sense Disambiguation (WSD), ii) domain labeling of text, in which the lexicographic resources made available by the ELEXIS project for research purposes allow better performances to be achieved, and finally iii) analysing the diachronic distribution of senses, for which a software package is made available.publishersversionpublishe

    JANES Expres workshop for promoting corpora and online resources for Slovene

    No full text
    Novembra in decembra 2015 so Filozofska fakulteta Univerze v Ljubljani, slovenska raziskovalna infrastruktura za jezikovne vire in tehnologije CLARIN.SI ter regionalna iniciativa za jezikovne podatke ReLDI organizirale dogodek JANES Ekspres, ki ga je v okviru razpisa za promocijo slovenske znanosti v tujini sofinancirala Javna agencija za raziskovalno dejavnost Republike Slovenije (ARRS). Cilj projekta je bil raziskovalcem in študentom v Sloveniji, na Hrvaškem in v Srbiji s predavanji in delavnicami predstaviti obstoječe korpusne ter spletne vire za slovenščino in gradnjo, označevanje ter analizo korpusa spletne slovenščine JANES, ki nastaja v okviru temeljnega raziskovalnega projekta JANES

    Frequency list of words by source from the Trendi corpus 2022-07

    No full text
    The frequency list of words by source was prepared in the following manner: words (i.e. lemmas with their lexical features) were extracted from 15 most frequent sources in the Trendi Monitor Corpus of Slovene (http://hdl.handle.net/11356/1590) covering the period between 1 January 2019 and 31 July 2022. The extracted sources are the following: - STA (sta.si) - RTV (rtvslo.si) - Delo (delo.si) - Siol (siol.net) - Vestnik (vestnik.si) - Večer (vecer.com) - Svet24 – Novice (novice.svet24.si) - 24ur (24ur.com) - Dnevnik (dnevnik.si) - Žurnal24 (zurnal24.si) - Demokracija (demokracija.si) - Nova24TV (nova24tv.si) - Slovenske novice (slovenskenovice.si) - Gorenjski glas (gorenjskiglas.si) - Svet 24 – Ekipa (ekipa.svet24.si) The frequency lists obtained from Trendi were then compared to the frequency list of words from Gigafida 2.0 (http://hdl.handle.net/11356/1320; covering the period between 1991–2018). The final frequency list contains lemmas, their lexical features, and – for each source (including Gigafida 2.0) – their absolute and relative frequencies from the first (1991–2018) and second periods (from 2019 to 2022-07), as well as the simple maths value indicating if the word is more frequent in 2019-2022-07 (simple maths > 1.00) or in 1991–2018 (simple maths < 1.00). Because the entire frequency list is quite large, a shorter version with the first 150,000 entries is also provided for easier use in data processing software (such as MS Excel). The lists are sorted by their total absolute frequencies. Note that words with a total frequency of 1 (when adding absolute frequencies from both compared corpora; hapax legomena) were removed

    Poletna šola korpusnega jezikoslovja v Lancastru

    No full text
    Poletna šola korpusnega jezikoslovja v Lancastr

    Frequency lists of word-level n-grams from the Trendi corpus 2021

    No full text
    Frequency lists of word-level n-grams (or word sets) were extracted from the Trendi Monitor Corpus of Slovene (version 2022-05: http://hdl.handle.net/11356/1590) using the LIST corpus extraction tool (http://hdl.handle.net/11356/1227). The lists contain all word-level 2-, 3-, 4- and 5-grams with minimum relative frequency of 2 per million occurring in the corpus in texts published in 2021, along with their absolute and relative frequencies and percentages. The n-grams were extracted from lower-case word forms along with lemmas and morphosyntactic tags. For frequency lists of n-grams extracted from texts from previous years (e.g. 2019 and 2020), please refer to earlier versions of this entry
    corecore