8 research outputs found
PTPARL-D: Annotated Corpus of 44 years of Portuguese Parliament debates
In a representative democracy, some decide in the name of the rest, and these
elected officials are commonly gathered in public assemblies, such as
parliaments, where they discuss policies, legislate, and vote on fundamental
initiatives. A core aspect of such democratic processes are the plenary
debates, where important public discussions take place. Many parliaments around
the world are increasingly keeping the transcripts of such debates, and other
parliamentary data, in digital formats accessible to the public, increasing
transparency and accountability. Furthermore, some parliaments are bringing old
paper transcripts to semi-structured digital formats. However, these records
are often only provided as raw text or even as images, with little to no
annotation, and inconsistent formats, making them difficult to analyze and
study, reducing both transparency and public reach. Here, we present PTPARL-D,
an annotated corpus of debates in the Portuguese Parliament, from 1976 to 2019,
covering the entire period of Portuguese democracy
Data sparsity in highly inflected languages: the case of morphosyntactic tagging in Polish
In morphologically complex languages, many high-level tasks in natural language
processing rely on accurate morphosyntactic analyses of the input. However, in
light of the risk of error propagation in present-day pipeline architectures for basic
linguistic pre-processing, the state of the art for morphosyntactic tagging is still
not satisfactory. The main obstacle here is data sparsity inherent to natural lan-
guage in general and highly inflected languages in particular.
In this work, we investigate whether semi-supervised systems may alleviate the
data sparsity problem. Our approach uses word clusters obtained from large
amounts of unlabelled text in an unsupervised manner in order to provide a su-
pervised probabilistic tagger with morphologically informed features. Our evalua-
tions on a number of datasets for the Polish language suggest that this simple
technique improves tagging accuracy, especially with regard to out-of-vocabulary
words. This may prove useful to increase cross-domain performance of taggers,
and to alleviate the dependency on large amounts of supervised training data,
which is especially important from the perspective of less-resourced languages
On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers
This contribution seeks to provide a rational probabilistic explanation for the intelligibility
of words in a genetically related language that is unknown to the reader, a phenomenon
referred to as intercomprehension. In this research domain, linguistic distance, among
other factors, was proved to correlate well with the mutual intelligibility of individual words.
However, the role of context for the intelligibility of target words in sentences was subject
to very few studies. To address this, we analyze data from web-based experiments in
which Czech (CS) respondents were asked to translate highly predictable target words at
the final position of Polish sentences. We compare correlations of target word intelligibility
with data from 3-g language models (LMs) to their correlations with data obtained from
context-aware LMs. More specifically, we evaluate two context-aware LM architectures:
Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance
dependencies into account and Transformer-based LMs which can access the whole
input sequence at the same time. We investigate how their use of context affects surprisal
and its correlation with intelligibility
O projekcie słownika polskiego parlamentaryzmu XX WIEKU (lata 1918-2018): Etap wstępny – korpus parlamentarny
The article discusses the preliminary results of research into the language corpus of the Polish parliamentary transcripts. Emphasis has been placed on a study of the emotional level of randomly selected transcripts. Two important lexemes (concept): niezawisłość and niepodległość have been studied by academics in terms of their use in transcripts. The frequency of these words has been estimated. Moreover, the semantic transformations of these words depending on the use of these lexemes in this discourse have been subjected to an academic studyThe article discusses the preliminary results of research into the language corpus of the Polish parliamentary transcripts. Emphasis has been placed on a study of the emotional level of randomly selected transcripts. Two important lexemes (concept): niezawisłość and niepodległość have been studied by academics in terms of their use in transcripts. The frequency of these words has been estimated. Moreover, the semantic transformations of these words depending on the use of these lexemes in this discourse have been subjected to an academic stud
"Cyberprzestrzeń" w polskim dyskursie parlamentarnym
Przedmiotem opracowania jest konceptualizacja cyberprzestrzeni w polskim dyskursie parlamentarnym.
Za istotne uznano rozpoznanie, jak przedstawiciele władzy ustawodawczej w latach
2001–2018 postrzegali cyberprzestrzeń. Źródłem analiz był materiał wyekscerpowany z Korpusu
Dyskursu Parlamentarnego. W badaniach wykorzystano podejście właściwe etnolingwistyczno-kognitywnemu
profilowaniu pojęć (językowy i dyskursywny obraz świata) oraz historycznojęzykowe,
by ustalić elementy stałe i zmienne w konceptualizowaniu cyberprzestrzeni. Za komponenty trwałe
uznano: postrzeganie cyberprzestrzeni w kategoriach zagrożenia (konteksty prawne i militarne) oraz jej
definicyjną ulotność. Z kolei główną zmianę można dostrzec w przejściu od przedstawiania badanego
fenomenu jako nowości do jego upodmiotowienia
The ParlaMint corpora of parliamentary proceedings
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project’s GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis
On the concepts of history (historia) and society (społeczeństwo) in the Polish parliamentary discourse (based on the corpus of parliamentary transcripts from 1918–2018)
The work conducted for over ten months at IPI PAN has resulted in the creation of a working (and still further developed) corpus of the 20th-century Polish parliamentarisms. The corpus was created using parliamentary transcripts from the years 1918–2018 and as for now it contains nearly 200 million segments. On this basis, a preliminary analytical work on the language of the Polish parliamentarism of the twentieth century is being conducted. One of the first issues is the preliminary lexicographical and lexicological analysis of the assembled corpus. In order to show the extent of the corpus and its chronological complexity, a lexical and semantic analysis will be subjected to, for example, such lexemes as historia (history) and społeczeństwo (society). The analysis of the usage of these items in the Polish parliamentary discourse has shown that they are high
frequency words, and that their meanings are subject to “specific pressures of parliamentarism” and slightly differ (depending on a particular period in history) from the meanings traditionally assigned to them in Polish [email protected] im. Adama Mickiewicza w PoznaniuBańko M., 2001, Z pogranicza leksykografii i językoznawstwa. Studia o słowniku jednojęzycznym, Warszawa: Uniwersytet Warszawski. Wydział Polonistyki.Bartmiński J., 2006, Językowe podstawy obrazu świata, Lublin: Wydawnictwo Uniwersytetu Marii Curie-Skłodowskiej.Kieraś W., Kobyliński Ł., Ogrodniczuk M., 2018, Korpusomat – a tool for creating searchable morphosyntactically tagged corpora, „Computational Methods in Science and Technology”, nr 24(1), s. 21–27.Miodunka W., 1989, Podstawy leksykologii i leksykografii, Warszawa: Państwowe Wydawnictwo Naukowe.Ogrodniczuk M., 2012, The Polish Sejm Corpus, [w:] Proceedings of the Eighth International Conference on Language Resources and Evaluation, red. N. Calzolari i in., Istanbul: European Language Resources Association (ELRA), s. 2219–2223.Ogrodniczuk M., 2018, Polish Parliamentary Corpus, [w:] Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using ParliamentaryCorpora, red. D. Fišer, M. Eskevich, F. de Jong, Paris: European Language Resources Association (ELRA), s. 15–19.Wielki słownik języka polskiego PAN. Geneza, koncepcja, zasady opracowania, 2018, red. P. Żmigrodzki i in., Kraków: Instytut Języka Polskiego PAN.2113916