8 research outputs found

    PTPARL-D: Annotated Corpus of 44 years of Portuguese Parliament debates

    Full text link
    In a representative democracy, some decide in the name of the rest, and these elected officials are commonly gathered in public assemblies, such as parliaments, where they discuss policies, legislate, and vote on fundamental initiatives. A core aspect of such democratic processes are the plenary debates, where important public discussions take place. Many parliaments around the world are increasingly keeping the transcripts of such debates, and other parliamentary data, in digital formats accessible to the public, increasing transparency and accountability. Furthermore, some parliaments are bringing old paper transcripts to semi-structured digital formats. However, these records are often only provided as raw text or even as images, with little to no annotation, and inconsistent formats, making them difficult to analyze and study, reducing both transparency and public reach. Here, we present PTPARL-D, an annotated corpus of debates in the Portuguese Parliament, from 1976 to 2019, covering the entire period of Portuguese democracy

    Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish

    Get PDF
    In morphologically complex languages, many high-level tasks in natural language processing rely on accurate morphosyntactic analyses of the input. However, in light of the risk of error propagation in present-day pipeline architectures for basic linguistic pre-processing, the state of the art for morphosyntactic tagging is still not satisfactory. The main obstacle here is data sparsity inherent to natural lan- guage in general and highly inflected languages in particular. In this work, we investigate whether semi-supervised systems may alleviate the data sparsity problem. Our approach uses word clusters obtained from large amounts of unlabelled text in an unsupervised manner in order to provide a su- pervised probabilistic tagger with morphologically informed features. Our evalua- tions on a number of datasets for the Polish language suggest that this simple technique improves tagging accuracy, especially with regard to out-of-vocabulary words. This may prove useful to increase cross-domain performance of taggers, and to alleviate the dependency on large amounts of supervised training data, which is especially important from the perspective of less-resourced languages

    On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers

    Get PDF
    This contribution seeks to provide a rational probabilistic explanation for the intelligibility of words in a genetically related language that is unknown to the reader, a phenomenon referred to as intercomprehension. In this research domain, linguistic distance, among other factors, was proved to correlate well with the mutual intelligibility of individual words. However, the role of context for the intelligibility of target words in sentences was subject to very few studies. To address this, we analyze data from web-based experiments in which Czech (CS) respondents were asked to translate highly predictable target words at the final position of Polish sentences. We compare correlations of target word intelligibility with data from 3-g language models (LMs) to their correlations with data obtained from context-aware LMs. More specifically, we evaluate two context-aware LM architectures: Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance dependencies into account and Transformer-based LMs which can access the whole input sequence at the same time. We investigate how their use of context affects surprisal and its correlation with intelligibility

    O projekcie słownika polskiego parlamentaryzmu XX WIEKU (lata 1918-2018): Etap wstępny – korpus parlamentarny

    Get PDF
    The article discusses the preliminary results of research into the language corpus of the Polish parliamentary transcripts. Emphasis has been placed on a study of the emotional level of randomly selected transcripts. Two important lexemes (concept): niezawisłość and niepodległość have been studied by academics in terms of their use in transcripts. The frequency of these words has been estimated. Moreover, the semantic transformations of these words depending on the use of these lexemes in this discourse have been subjected to an academic studyThe article discusses the preliminary results of research into the language corpus of the Polish parliamentary transcripts. Emphasis has been placed on a study of the emotional level of randomly selected transcripts. Two important lexemes (concept): niezawisłość and niepodległość have been studied by academics in terms of their use in transcripts. The frequency of these words has been estimated. Moreover, the semantic transformations of these words depending on the use of these lexemes in this discourse have been subjected to an academic stud

    "Cyberprzestrzeń" w polskim dyskursie parlamentarnym

    Get PDF
    Przedmiotem opracowania jest konceptualizacja cyberprzestrzeni w polskim dyskursie parlamentarnym. Za istotne uznano rozpoznanie, jak przedstawiciele władzy ustawodawczej w latach 2001–2018 postrzegali cyberprzestrzeń. Źródłem analiz był materiał wyekscerpowany z Korpusu Dyskursu Parlamentarnego. W badaniach wykorzystano podejście właściwe etnolingwistyczno-kognitywnemu profilowaniu pojęć (językowy i dyskursywny obraz świata) oraz historycznojęzykowe, by ustalić elementy stałe i zmienne w konceptualizowaniu cyberprzestrzeni. Za komponenty trwałe uznano: postrzeganie cyberprzestrzeni w kategoriach zagrożenia (konteksty prawne i militarne) oraz jej definicyjną ulotność. Z kolei główną zmianę można dostrzec w przejściu od przedstawiania badanego fenomenu jako nowości do jego upodmiotowienia

    The ParlaMint corpora of parliamentary proceedings

    Get PDF
    This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis

    On the concepts of history (historia) and society (społeczeństwo) in the Polish parliamentary discourse (based on the corpus of parliamentary transcripts from 1918–2018)

    No full text
    The work conducted for over ten months at IPI PAN has resulted in the creation of a working (and still further developed) corpus of the 20th-century Polish parliamentarisms. The corpus was created using parliamentary transcripts from the years 1918–2018 and as for now it contains nearly 200 million segments. On this basis, a preliminary analytical work on the language of the Polish parliamentarism of the twentieth century is being conducted. One of the first issues is the preliminary lexicographical and lexicological analysis of the assembled corpus. In order to show the extent of the corpus and its chronological complexity, a lexical and semantic analysis will be subjected to, for example, such lexemes as historia (history) and społeczeństwo (society). The analysis of the usage of these items in the Polish parliamentary discourse has shown that they are high frequency words, and that their meanings are subject to “specific pressures of parliamentarism” and slightly differ (depending on a particular period in history) from the meanings traditionally assigned to them in Polish [email protected] im. Adama Mickiewicza w PoznaniuBańko M., 2001, Z pogranicza leksykografii i językoznawstwa. Studia o słowniku jednojęzycznym, Warszawa: Uniwersytet Warszawski. Wydział Polonistyki.Bartmiński J., 2006, Językowe podstawy obrazu świata, Lublin: Wydawnictwo Uniwersytetu Marii Curie-Skłodowskiej.Kieraś W., Kobyliński Ł., Ogrodniczuk M., 2018, Korpusomat – a tool for creating searchable morphosyntactically tagged corpora, „Computational Methods in Science and Technology”, nr 24(1), s. 21–27.Miodunka W., 1989, Podstawy leksykologii i leksykografii, Warszawa: Państwowe Wydawnictwo Naukowe.Ogrodniczuk M., 2012, The Polish Sejm Corpus, [w:] Proceedings of the Eighth International Conference on Language Resources and Evaluation, red. N. Calzolari i in., Istanbul: European Language Resources Association (ELRA), s. 2219–2223.Ogrodniczuk M., 2018, Polish Parliamentary Corpus, [w:] Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using ParliamentaryCorpora, red. D. Fišer, M. Eskevich, F. de Jong, Paris: European Language Resources Association (ELRA), s. 15–19.Wielki słownik języka polskiego PAN. Geneza, koncepcja, zasady opracowania, 2018, red. P. Żmigrodzki i in., Kraków: Instytut Języka Polskiego PAN.2113916