14 research outputs found

    Agenda-Setting Dynamics during COVID-19 : Who Leads and Who Follows?

    Get PDF
    The outbreak of the coronavirus (COVID-19) has altered the way news media and social media set their agendas. The growth of social media raises questions about its potential power to set the media agenda. We gathered social media posts and online news site articles to examine agenda-setting dynamics, aiming to explore causal relationship between news media and social media. We used a computer-assisted text analysis to discover the main topics of discussion at the first wave of the pandemic in Latvia. The results revealed that (1) statistics about the pandemic, as well as pre-vention and control measures were the main topics on social media and in online news sites, and that (2) vector autoregression models provide more empirical support for the influence of online news sites on social media than reverse.publishersversionPeer reviewe

    LaVA - Latvian Language Learner corpus

    Get PDF
    Funding Information: The work reported in this paper is a part of the project Development of Learner Corpus of Latvian: methods, tools and applications (Project No. lzp-2018/1-0527) that is being implemented at the Institute of Mathematics and Computer Science, University of Latvia (IMCS UL) since September 2018. The project is financed by Latvian Council of Science. This work is also a part of the Latvian State Research Programme Letonika - Fostering a Latvian and European Society project Research on Modern Latvian Language and Development of Language Technology (No. VPP-LETONIKA-2021/1-0006) and has received financial support from the Latvian Language Agency through the grant agreement No. 4.6/2019-029. Publisher Copyright: © European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.This paper presents the Latvian Language Learner Corpus (LaVA) developed at the Institute of Mathematics and Computer Science, University of Latvia. LaVA corpus contains 1015 essays (190k tokens and 790k characters excluding whitespaces) from foreigners studying at Latvian higher education institutions and who are learning Latvian as a foreign language in the first or second semester, reaching the A1 (possibly A2) Latvian language proficiency level. The corpus has morphological and error annotations. Error analysis and the statistics of the LaVA corpus are also provided in the paper. The corpus is publicly available at: http://www.korpuss.lv/id/LaVA.publishersversionPeer reviewe

    Corpus Based Self-Assessment Platform for Latvian Language Learners

    Get PDF
    Funding Information: This work is also a part of the National Research Programme Digital Resources of the Humanities project Digital Resources for Humanities: Integration and Development (No. VPP-IZM-DH-2020/1-0001) and has received financial support from the Latvian Language Agency through the grant agreement No. 4.6/2019-029. Publisher Copyright: Copyright © 2022 the American Physiological Society.This paper presents a self-assessment platform for Latvian language learners in the breakthrough (A1) and Waystage (A2) levels. The self-assessment platform contains three types of exercises (typing, inflection and gap filling) based on error analysis of the Latvian Language Learner corpus (LaVA). All exercises are automatically generated based on data from multiple corpora. The automatically generated exercises are useful not only for learners outside of classroom or even outside of any formal education setting, but also for educators and authors of learning aids. Currently the self-assessment platform is tailored for language learners at the beginner level, but it can be easily extended for more advanced levels. The self-assessment platform is freely available online (http://uzdevumi.riks.korpuss.lv/en/) and the interface is translated in two language – Latvian and English.publishersversionPeer reviewe

    Clustering algorithms for large scale data sets

    No full text
    Mūsdienās, strauji pieaugot internetā pieejamajam informācijas apjomam, aktuāla tēma valodu tehnoloģiju jomā ir informācijas grupēšana (klasterēšana) pēc vienotiem principiem vieglākai informācijas uztveramībai un izkaisītas informācijas apjoma mazināšanai. Darbā "Liela apjoma datu kopu klasterēšanas algoritmi" teorētiskajā daļā izpētītas un apkopotas metodes dokumentu klasterēšanai ar mērķi atrast piemērotāko metodi vai metožu kopu daudzvalodu ziņu straumju klasterēšanai. Darbā arī pētītas un salīdzinātas dažādas klasterēšanas rezultāta novērtēšanas metrikas. Praktiskajā daļā izstrādāta un izvērtēta sistēma daudzvalodu ziņu straumju klasterēšanai, tā darbības rezultāti un turpmāki pētījuma virzieni apkopoti darba nobeigumā.Nowadays one of the current trends of language technologies is information clustering with common traits for simplified information perception and reduction of scattered information amount in the continuous large information flow. Study "Clustering algorithms for large scale data sets" theoretical part includes research of the methods used in document clustering. The goal is to find the most suitable method or set of methods for multilingual message stream clustering. The work also contains research and comparison of various metrics of clustering result evaluation. The practical part includes implementation and evaluation of the system for multilingual news clustering. The results and future research directions are included in the end of the work

    The Analysis and Development of the Card Game Klondike

    No full text
    Ar katru dienu dažādas datorspēles kļūst aizvien populārākas bērnu vidū, bet nevienam nepatīk spēlēt spēles, kurās nevar uzvarēt. Šajā kvalifikācijas darbā ir aprakstīta kāršu spēles „Klondaikas pasjanss” (no angļu val. Klondike Solitaire) analīzes sistēma, ar kuras palīdzību varētu atrast izspēlējamus sadalījumus, lai varētu piedāvāt lietotājiem spēlēt tikai uzvaramas situācijas. Sistēma sagatavo arī plašākus datus par analīzes rezultātiem, kurus apstrādājot, varētu atrast kritērijus, pēc kuriem noteikt sadalījuma grūtības pakāpi. Darba pasūtītājs un izstrādātājs ir darba autors. Programma izstrādāta VB.NET valodā, izmantojot „Microsoft .NET 4.5” ietvaru. Atslēgvārdi: VB.NET, Klondaikas pasjanss, kāršu spēle, analīzeEvery day computer games become more popular amongst children. Though, nobody likes to play games which cannot be won. The system discussed in this written work offers to play only the card deals of card game called Klondike Solitaire which can be won. The system also creates wide broad of analytics data which could provide the criterion for finding out the difficulty of the deal. The developer and customer is the same person. The program has been developed on .Microsoft NET 4.5 framework using VB.NET programming language. Key words: VB.NET, Klondike Solitaire, card game, analytic

    The development of phonetic dictionary and language model for latvian speech processing

    No full text
    Darbā tika pētītas divas no svarīgākajām sastāvdaļām ir valodas modelis un fonētiskā vārdnīca. Analizēta četru dažādu tekstu korpusu un to apvienojumu ietekme uz runas atpazīšana kvalitāti latviešu valodā. Testētas trīs dažādas fonētiskās izrunas ieguves metodes. Iegūtais rezultāts ir valodas atkarīgs, bet izmantotās metodes ir valodas neatkarīgas. Bāzlīnijas nepārtrauktas runas atpazīšanas sistēmas precizitāte ir 36.17%. Pēc uzlabojumu veikšanas precizitāte paaugstinājās par 6.45%, no 36.17% uz 42.62%. Lai gan labākie rezultāti tika sasniegti ar bāzlīnijas metodēm, darba izstrādes laikā iegūtās zināšanas ļaus pilnveidot bāzlīnijā izmantoto metožu kvalitāti. Atslēgvārdi: runas atpazīšanas sistēmas, runas atpazīšana process, valodas modeļi, fonētiskā vārdnīca.The study investigated two of the most important components of speech processing - a language model and phonetic dictionary. Comparison of four different text corpora and their combination was done to estimate language model impact on Latvian speech recognition. Three different phonetic pronunciation extraction methods were tested. The result is language dependent, but the methods used are language independent. The baseline for continuous speech recognition system is 36.17%. After improvement the accuracy increased by 6:45% from 36.17% to 42.62%. Although the best results were achieved with the baseline methods, the knowledge gained in method development will help to improve the quality of the methods used in the baseline. Keywords: speech recognition systems, speech recognition, language modeling, graphe-to-phoneme modelation

    Dictionary and Thesaurus of Latvian - Tezaurs.lv (ELEXIS)

    No full text
    Tēzaurs.lv: An extensive dictionary and thesaurus of Latvian, comprising more than 320,000 lexical entries, including multi-word units. Compiled and edited based on more than 300 sources. Provides detailed morphological information; being extented into a Latvian WordNet

    Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 2.0

    No full text
    ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the linguistically marked-up version of the corpus, while the text version is available at http://hdl.handle.net/11356/1388. The ParlaMint.ana linguistic annotation includes tokenization, sentence segmentation, lemmatisation, Universal Dependencies part-of-speech, morphological features, and syntactic dependencies, and the 4-class CoNLL-2003 named entities. Some corpora also have further linguistic annotations, such as PoS tagging or named entities according to language-specific schemes, with their corpus TEI headers giving further details on the annotation vocabularies and tools. The compressed files include the ParlaMint.ana XML TEI-encoded linguistically annotated corpus; the derived corpus in CoNLL-U with TSV speech metadata; and the vertical files (with registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project

    Multilingual comparable corpora of parliamentary debates ParlaMint 2.0

    No full text
    ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting in 2015 and extending to mid-2020, with each corpus being about 20 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), or being "reference" (before that date). The corpora have extensive metadata, including aspects of the parliament; the speakers (name, gender, MP status, party affiliation, party coalition/opposition); are structured into time-stamped terms, sessions and meetings; with speeches being marked by the speaker and their role (e.g. chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. Note that some corpora have further information, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The corpora are encoded according to the Parla-CLARIN TEI recommendation (https://clarin-eric.github.io/parla-clarin/), but have been validated against the compatible, but much stricter ParlaMint schemas. This entry contains the ParlaMint TEI-encoded corpora with the derived plain text version of the corpus along with TSV metadata on the speeches. Also included is the 2.0 release of the data and scripts available at the GitHub repository of the ParlaMint project. Note that there also exists the linguistically marked-up version of the corpus, which is available at http://hdl.handle.net/11356/1405
    corecore