Slovenian parliamentary corpus SlovParl 1.0

Abstract

The SlovParl corpus contains minutes of the Chamber of Associated Labour of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after Slovenia became an independent country in 1991. The corpus comprises 54 sessions, 13,894 speeches and almost 2.7 million words. The corpus contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations. This item comprises three datasets: - the corpus in TEI P5 (module Transcriptions of speech); - the corpus in TEI P5 with added automatic linguistic annotation: tokenisation, MSD tagging and lemmatisation; - the corpus in vertical format used by various concordancers, e.g. CWB and Sketch Engine; this format is simpler and smaller but does not contain all the information from the source TEI. The SlovParl data originally come from https://github.com/SIstory/SlovParl, but have been converted to use TEI elements for speech. This version of the corpus corresponds to commit https://github.com/DARIAH-SI/CLARIN.SI/tree/5984661e7b19e054b3fb650f4d2d5d409b3d7e3d The resource is presented in the paper: Pančur, Andrej. "Označevanje zbirke zapisnikov sej slovenskega parlamenta s smernicami TEI." In the Proceedings of the Conference on Language Technologies & Digital Humanities (Tomaž Erjavec and Darja Fišer, eds.) 142-148. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani, 2016. http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Pancur_Oznacevanje-zbirke-zapisnikov-sej-slovenskega-parlamenta.pd

    Similar works