    Konferenca »Jezikovne tehnologije in digitalna humanistika 2020«

    Konferenca Jezikovne tehnologije in digitalna humanistika 2020, ki jo skupaj z Inštitutom za novejšo zgodovino , Centrom za jezikovne vire in tehnologije Univerze v Ljubljani (CJVT) ter raziskovalnima infrastrukturama CLARIN.SI in DARIAH-SI organizira Slovensko društvo za jezikovne tehnologije (SDJT), je letos potekala 24. in 25. septembra 2020, že tretjo multidisciplinarno izvedbo konference pa je podprl CLARIN ERIC. Konferenca, ki se lahko pohvali z več kot 20-letno tradicijo delovanja, je leta 2016 v svoj program vključila tudi področje digitalne humanistike in s tem postala pomemben povezovalni člen med omenjenima disciplinama

    Nadgradnja Zgodovinarskega indeksa citiranosti

    Začetki Zgodovinarskega indeksa citiranja segajo v leto 2003, ko so raziskovalci Inštituta za novejšo zgodovino začeli spremljati in sistematično popisovati citate za prijave projektov in programov na ARRS. Citatni indeks je doživel nekaj nadgradenj, poskusov harmonizacije podatkov in prečiščevanja relacijskih baz, vendar je bilo v zadnjih letih ugotovljeno, da sistem ne zadostuje potrebam indeksatorjev in uporabnikov. Pred nadgradnjo smo izvedli analizo podatkov, kjer so se identificirale največje težave. Nadgradnja je potekala v dveh delih; v prvem delu smo nadgradili administrativni del, v drugem delu pa spletno aplikacijo. Zgodovinarski indeks citiranja je bil med nadgradnjo tehnično posodobljen in s tem oblikovan tako, da je intuitiven za indeksatorje in uporabnike

    Design and implementation of an application profile for digital critical editions in the repository SI-DIH

    Nobena metapodatkovna shema ne more nasloviti vseh potreb heterogene skupnosti uporabnikov in njihovih specifičnih informacijskih virov. Za prototip repozitorija SI-DIH, ki ga vzdržuje Inštitut za novejšo zgodovino pod okriljem slovenske veje infrastrukture DARIAH-SI, smo v pričujočem delu razvili aplikacijski profil za digitalne znanstvenokritične izdaje, ki so na področju digitalne humanistike pogosta oblika objave znanstvenih dosežkov. V empiričnem delu smo prikazali postopek oblikovanja aplikacijskega profila po predlogi Zeng in Qin (2008) in opisali postopek implementacije v repozitorij SI-DIH oziroma njegov administracijski sistem. Aplikacijski profil smo uspešno implementirali na podlagi vzorčnega primera testne digitalne izdaje Odlivanje smrti v repozitorij SI-DIH. Na oblikovanih nadomestnih personah in evalvacijskih scenarijih smo ocenili tudi način uporabe aplikacijskega profila. Rezultati so pokazali, da aplikacijski profil v večini primerov lahko ponudi odgovor na informacijske potrebe dveh glavnih uporabniških skupin repozitorija SI-DIH, čeprav brez formalnega uporabniškega testiranja tega ne moremo trditi z gotovostjo.No metadata scheme can address all of the needs of a heterogeneous community of users and their specific information resources. For the SI-DIH repository, which is maintained by the Institute of Contemporary History, we have developed an application profile for digital critical editions, which are a common form of publishing scientific achievements in the field of digital humanities. In the empirical part of the thesis, we presented the process of creating an application profile according to Zeng and Chan (2008) and described the process of implementation in the SI-DIH repository, and its administration system. We successfully implemented the application profile into the SI-DIH repository, where we used a digital edition titled »Casting of Death« as a testing example. To check the quality and evaluate the performance of the application profile we designed three ad hoc user personas and created three evaluation scenarios, on which we tested the usefulness of the newly-implemented application profile. Results showed that the application profile could adress the information needs of the two main user groups of the SI-DIH repository. Although, without a formal user testing, we are not able to claim that with certainty

    The multilingual sentiment dataset of parliamentary debates ParlaSent 1.0

    The dataset consists of mid-length sentences from the parliamentary proceedings of Bosnia and Herzegovina, Croatia, Czechia, Serbia, Slovakia, Slovenia, and the United Kingdom, annotated with a 6-level sentiment schema (defined below). The data coming from the parliaments of Bosnia and Herzegovina, Croatia and Serbia are organised as a single parliament group, named "BCS", due to the similarity of the official languages in these countries. For each of the six parliaments / parliament groups, 2,600 training instances were annotated by two annotators, with one additional conflict resolution step. While these training instances were sampled via sentiment lexicons to contain more sentiment-loaded sentences, two test sets were randomly sampled from selected parliaments, one from the BCS parliament group, another from the parliament of the United Kingdom. Each test set consists of 2,600 sentences, annotated by one highly trained annotator. Training datasets were internally split into "train", "dev" and "test" portions" for performing language-specific experiments. The 6-level annotation schema is the following: - Positive for sentences that are entirely or predominantly positive - Negative for sentences that are entirely or predominantly negative - M_Positive for sentences that convey an ambiguous sentiment or a mixture of sentiments, but lean more towards the positive sentiment - M_Negative for sentences that convey an ambiguous sentiment or a mixture of sentiments, but lean more towards the negative sentiment - P_Neutral for sentences that only contain non-sentiment-related statements, but still lean more towards the positive sentiment - N_Neutral for sentences that only contain non-sentiment-related statements, but still lean more towards the negative sentimen

    Slovenian parliamentary corpus (1990-2022) siParl 3.0

    The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 8th legislative period 1992-2022, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises of over 11 thousand sessions, one million speeches and 200 million words. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file. As opposed to the previous version 2.0, this version adds new data (minutes of the National Assembly of the Republic of Slovenia of the 8th legislative period) and corrects many errors

    Genetic predisposition to hypertension is associated with preeclampsia in European and Central Asian women

    Preeclampsia is a serious complication of pregnancy, affecting both maternal and fetal health. In genome-wide association meta-analysis of European and Central Asian mothers, we identify sequence variants that associate with preeclampsia in the maternal genome at ZNF831/20q13 and FTO/16q12. These are previously established variants for blood pressure (BP) and the FTO variant has also been associated with body mass index (BMI). Further analysis of BP variants establishes that variants at MECOM/3q26, FGF5/4q21 and SH2B3/12q24 also associate with preeclampsia through the maternal genome. We further show that a polygenic risk score for hypertension associates with preeclampsia. However, comparison with gestational hypertension indicates that additional factors modify the risk of preeclampsia. Studies to identify maternal variants associated with preeclampsia have been limited by sample size. Here, the authors meta-analyze eight GWAS of 9,515 preeclamptic women, identifying five variants associated with preeclampsia and showing that genetic predisposition to hypertension is a major risk factor for preeclampsia.Peer reviewe

    Preeclampsia is a serious complication of pregnancy, affecting both maternal and fetal health. In genome-wide association meta-analysis of European and Central Asian mothers, we identify sequence variants that associate with preeclampsia in the maternal genome at ZNF831/20q13 and FTO/16q12. These are previously established variants for blood pressure (BP) and the FTO variant has also been associated with body mass index (BMI). Further analysis of BP variants establishes that variants at MECOM/3q26, FGF5/4q21 and SH2B3/12q24 also associate with preeclampsia through the maternal genome. We further show that a polygenic risk score for hypertension associates with preeclampsia. However, comparison with gestational hypertension indicates that additional factors modify the risk of preeclampsia

    Multilingual comparable corpora of parliamentary debates ParlaMint 3.0

    ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora being between 9 and 125 million words in size. The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; and with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are also marked to the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been encoded against the compatible, but much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in this distribution. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text versions of the corpora along with TSV metadata of the speeches. Also included is the 3.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1488. As opposed to the previous version 2.1, this version extends the corpus dates to (at least) mid 2022, does not contain the corpora for ES (Spanish) and Lithuanian (LT), and adds corpora for AT (Austria), BA (Bosnian), ES-CT (Catalonia), ES-GA (Galicia), GR (Greece), NO (Norway), PT (Portugal), RS (Serbia), SE (Sweden), and UA (Ukraine). The TEI encoding of some details has also changed