18 research outputs found

    Opis oderuhov v 19. stoletju na Slovenskem

    Get PDF

    The Parla-CLARIN Recommendations for Encoding Corpora of Parliamentary Proceedings

    Get PDF
    Parliamentary proceedings are a rich source of data that can be used by scholars in various humanities and social sciences disciplines. Unlike the sources of most other language corpora, parliamentary proceedings are not subject to copyright or personal privacy protections, and are typically available online, thus making them ideal for compilation into corpora and for open distribution. For these reasons many countries have already produced corpora of parliamentary proceedings, but each typically in their own encoding, limiting their comparability and utilization in a multilingual setting. In this paper we propose an encoding schema which could serve as an interchange format for parliamentary corpora compiled for the purposes of scholarly investigations. The schema, called Parla-CLARIN, was developed within the CLARIN research infrastructure, and is written as a TEI ODD which includes a TEI customization and prose guidelines with examples of use. We discuss the coverage and choices made in designing the recommendations, and give an overview of the guidelines. We also discuss two other standard schemas for encoding parliamentary data, Akoma Ntoso and RDF, and their relation to Parla-CLARIN. We conclude by presenting corpora already encoded in Parla-CLARIN and discussing further work, especially the provision of a set of example documents and of transformation scripts that would make the proposed encoding more usable

    The ParlaMint corpora of parliamentary proceedings

    Get PDF
    This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis

    Slovenian parliamentary corpus SlovParl 1.0

    No full text
    The SlovParl corpus contains minutes of the Chamber of Associated Labour of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after Slovenia became an independent country in 1991. The corpus comprises 54 sessions, 13,894 speeches and almost 2.7 million words. The corpus contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations. This item comprises three datasets: - the corpus in TEI P5 (module Transcriptions of speech); - the corpus in TEI P5 with added automatic linguistic annotation: tokenisation, MSD tagging and lemmatisation; - the corpus in vertical format used by various concordancers, e.g. CWB and Sketch Engine; this format is simpler and smaller but does not contain all the information from the source TEI. The SlovParl data originally come from https://github.com/SIstory/SlovParl, but have been converted to use TEI elements for speech. This version of the corpus corresponds to commit https://github.com/DARIAH-SI/CLARIN.SI/tree/5984661e7b19e054b3fb650f4d2d5d409b3d7e3d The resource is presented in the paper: Pančur, Andrej. "Označevanje zbirke zapisnikov sej slovenskega parlamenta s smernicami TEI." In the Proceedings of the Conference on Language Technologies & Digital Humanities (Tomaž Erjavec and Darja Fišer, eds.) 142-148. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani, 2016. http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Pancur_Oznacevanje-zbirke-zapisnikov-sej-slovenskega-parlamenta.pd

    Slovenian parliamentary corpus siParl 1.0 (1990-2018)

    No full text
    The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 7th legislative period 1992-2018, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises over a million speeches or 195 million words. The corpus contains basic meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations. This item comprises three datasets: - the corpus in TEI (module Transcriptions of speech); - the corpus in TEI with added automatic linguistic annotation: tokenisation, MSD tagging and lemmatisation; - the linguisticaly annotated corpus in vertical format used by various concordancers, e.g. CWB and Sketch Engine; this format is simpler and smaller but does not contain all the information from the source TEI. A preliminary version of this resource is presented in the paper: Pančur, Andrej, Mojca Šorn and Tomaž Erjavec (2018). "SlovParl 2.0: The Collection of Slovene Parliamentary Debates from the Period of Secession." Darja Fišer and Maria Eskevich and Franciska de Jong (eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 2018. http://lrec-conf.org/workshops/lrec2018/W2/summaries/4_W2.htm

    Slovenian parliamentary corpus (1990-2018) siParl 2.0

    No full text
    The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 7th legislative period 1992-2018, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises over 10 thousand sessions, one million speeches or 200 million words. The corpus contains meta-data about the speakers, a typology of sessions etc. and structural, editorial and linguistic annotations. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file. This item comprises the following datasets: 1. source DARAH-SI Parla-CLARIN encoded corpus; 2. linguistically annotatated Parla-CLARIN encoded corpus: tokenisation, MSD tagging, lemmatisation, Universal Dependencies features and syntactic parses, named entities; 3. linguisticaly annotated corpus in vertical format used by CWB and Sketch Engine concordancers; this format is simpler and smaller but does not contain all the information from the source TEI; 4. linguisticaly annotated corpus in CONLL-U format as used by Universal Dependencies 5. plain text of the corpus Note that each dataset also includes TSV meta-data files on sessions (files) and speakers. As opposed to the previous version 1.0, this version corrects many errors, has substantially better meta-data and the linguistic processing has more levels and less errors

    Digital Database of WWI Victims from Slovenia (ZV1): Project Cooperation Between the Digital Humanities and Cultural Heritage

    No full text
    Abstract and poster of paper 0667 presented at the Digital Humanities Conference 2019 (DH2019), Utrecht , the Netherlands 9-12 July, 2019

    Slovenian parliamentary corpus (1990-2022) siParl 3.0

    No full text
    The siParl corpus contains minutes of the Assembly of the Republic of Slovenia for 11th legislative period 1990-1992, minutes of the National Assembly of the Republic of Slovenia from the 1st to the 8th legislative period 1992-2022, minutes of the working bodies of the National Assembly of the Republic of Slovenia from the 2nd to the 7th legislative period 1996-2018, and minutes of the Council of the President of the National Assembly from the 2nd to the 7th legislative period 1996-2018. The corpus comprises of over 11 thousand sessions, one million speeches and 200 million words. The corpus is encoded according to the Parla-CLARIN schema (https://github.com/clarin-eric/parla-clarin). Each mandate is in one directory, and each session in one file. As opposed to the previous version 2.0, this version adds new data (minutes of the National Assembly of the Republic of Slovenia of the 8th legislative period) and corrects many errors