8 research outputs found

    Probabilistiset sanaupotteet Laplace-priorijakaumilla

    No full text
    In the 21st century, textual data has become abundant, large and easily available, which makes quantifying text and words interesting. This interest has been met with powerful word embedding methods such as Word2Vec. They turn large, unannotated data into real vector spaces with interesting properties. In recent years, several studies have proposed methods, which build on Word2Vec. By utilizing side-information, such as dictionary definitions, they enhance performance of the embeddings in different tasks. Word embeddings have also been framed as probabilistic language models. This study presents and examines a probabilistic word embedding model, Probabilistic Word Embeddings with Laplacian Priors (PELP). PELP is based on a previous probabilistic framing of the Word2Vec approach, where word and context vec- tors follow a spherical Gaussian prior distribution. PELP uses a Gaussian prior with the Laplacian matrix of a side-information graph as the precision matrix. This study examines PELP with side-information graphs consisting of dictionary information, translation pairs and antonym pairs. With dictionary side-information, PELP outperforms the base models in word similarity tasks. The improvement is substantial, and roughly matches the previous Word2Vec based models in magnitude. Additionally, PELP is found to be somewhat applicable to aspect isolation tasks. However, the performance in these tasks is somewhat inconsistent between runs. Finally, PELP’s applicability to cross-lingual embeddings is examined, though issues in the sampling process render the results indecisive. The results of this study give preliminary support for the use of Laplacian priors in probabilistic word embeddings, and warrant further research into the topic.2000-luvulla tekstimuotoisesta datasta on tullut määrältään massiivista sekä helposti saatavissa olevaa, mikä tekee tekstin ja sanojen kvantifioinnista mielenkiintoista. Tähän kiinnostukseen ovat vastanneet tehokkaat sanaupotteet (eng. word embedding), ensimmäisenä Word2Vec. Ne muuntavat suurta, jäsentämätöntä tekstidataa reaalivektoreiksi, joilla on mielenkiintoisia ominaisuuksia. Viime vuosina on esitelty useita paranneltuja versioita Word2Vec-menetelmästä, jotka hyödyntävät jäsenneltyä lisätietoa, kuten sanakirjamääritelmiä. Nämä menetelmät parantavat sanaupotteiden suorituskykyä eri sovellutuskohteissa. Sanaupotetteita on luotu myös probabilististen kielimallien pohjalta. Tässä diplomityössä esitellään probabilistinen sanaupotemalli Probabilistic Word Embeddings Laplacian Priors (PELP). PELP perustuu aikaisempaan Word2Vec-menetelmän probabilistiseen versioon, jossa sana- ja kontekstivektorit ovat normaalijakautuneita a priori. PELP käyttää normaalijakauman tarkkuusmatriisina Laplacen matriisia lisätietograafista. Tässä diplomityössä tutkitaan PELP:ä eri lisätietograafeilla, jotka pohjautuvat sanakirjamääritelmiin, käännöspareihin ja vastakohtapareihin. Sanakirjalisätiedolla PELP saavuttaa parempia tuloksia kuin perusmallit sanojen samanlaisuuden määrittämisessä. Parannus on huomattava, ja vastaa suurelta osin paranneltuja Word2Vec-pohjaisia malleja. Lisäksi tässä työssä tutkitaan PELP:n merkitysten eri aspektien eristämiseen. Suorituskyky näissä sovellutuksissa on kuitenkin epätasaista. Lisäksi tutkitaan PELP:n soveltuvuutta monikielisiin sanaupotteisiin. Näyttö PELP:n soveltuvuudesta. Tämän tutkimuksen tulokset antavat alustavaa tukea Laplace-priorijakaumien käytölle probabilistisissa sanaupotteissa ja aihetta jatkaa niiden tutkimista eri konteksteissa

    The Swedish parliament corpus 1867–2022

    No full text
    The Swedish parliamentary records are an important source material for social science and humanities researchers. We introduce a new research corpus, the Swedish Parliament Corpus, which is larger and more developed than previously available research corpora for the Swedish parliament. The corpus contains annotated and structured parliamentary records over more than 150 years, through the bicameral parliament (1867–1970) and the unicameral parliament (1971–). In addition to the records, which contain all speeches in the parliament, we also provide a database of all members of parliament over the same period. Along with the corpus, we describe procedures to ensure data quality. The corpus facilitates detailed analysis of parliamentary speeches in several research fields.Also part of series: LREC proceedings, ISBN: 2522-2686</p

    Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0

    No full text
    ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora being between 9 and 125 million words in size. The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; and with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are also marked to the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1486. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpus; the derived corpus in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 3.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. As opposed to the previous version 2.1, this version corrects some errors in various corpora and adds the information on upper / lower house for bicameral parliaments. The vertical files have also been changed to make them easier to use in the concordancers

    Multilingual comparable corpora of parliamentary debates ParlaMint 3.0

    No full text
    ParlaMint 3.0 is a multilingual set of 26 comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2022, with the individual corpora being between 9 and 125 million words in size. The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; and with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are also marked to the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been encoded against the compatible, but much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in this distribution. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text versions of the corpora along with TSV metadata of the speeches. Also included is the 3.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1488. As opposed to the previous version 2.1, this version extends the corpus dates to (at least) mid 2022, does not contain the corpora for ES (Spanish) and Lithuanian (LT), and adds corpora for AT (Austria), BA (Bosnian), ES-CT (Catalonia), ES-GA (Galicia), GR (Greece), NO (Norway), PT (Portugal), RS (Serbia), SE (Sweden), and UA (Ukraine). The TEI encoding of some details has also changed

    Multilingual comparable corpora of parliamentary debates ParlaMint 4.0

    No full text
    ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.1 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), the political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been encoded against the compatible, but much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). This entry contains the ParlaMint TEI-encoded corpora and their derived plain text versions along with TSV metadata of the speeches. Also included is the 4.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint. Note that there also exists the linguistically marked-up version of the 4.0 ParlaMint corpus, also linked with concordancers, which is available at http://hdl.handle.net/11356/1860. As opposed to the previous version 3.0, this version adds corpora for Spain (ES), Finland (FI) and the Basque Country (ES-PV); extends the corpora for Austria (AT), Czechia (CZ), Hungary (HU), and Ukraine (UA) with more recent data; adds metadata to political parties and parliamentary groups on left-to-right political orientation from Wikipedia as well as CHES variables; and adds the information on whether a speaker was a minister and when for the corpora that previously lacked this information. The TEI encoding of some details has also changed, and many errors found in 3.0 corpora have been corrected

    Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 4.0

    No full text
    ParlaMint 4.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.1 billion words. The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), the political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpus they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24). The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been encoded against the compatible, but much stricter ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution). The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, in particular PoS tagging according a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used. This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 4.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the (open) issues at the GitHub repository of the project. This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is available at http://hdl.handle.net/11356/1859. As opposed to the previous version 3.0, this version adds corpora for Spain (ES), Finland (FI) and the Basque Country (ES-PV); extends the corpora for Austria (AT), Czechia (CZ), Hungary (HU), and Ukraine (UA) with more recent data; adds metadata to political parties and parliamentary groups on left-to-right political orientation from Wikipedia, as well as CHES variables; adds the information on whether a speaker was a minister and when for the corpora that previously lacked this information. The TEI encoding of some details has also changed, and many errors found in 3.0 corpora have been corrected. Furthermore, the vertical files (and hence the individual corpora available on the concordancers) have their meta-data in the local language of the corpus, and not English
    corecore