34 research outputs found

    Mapping Topic Evolution Across Poetic Traditions

    Full text link
    Poetic traditions across languages evolved differently, but we find that certain semantic topics occur in several of them, albeit sometimes with temporal delay, or with diverging trajectories over time. We apply Latent Dirichlet Allocation (LDA) to poetry corpora of four languages, i.e. German (52k poems), English (85k poems), Russian (18k poems), and Czech (80k poems). We align and interpret salient topics, their trend over time (1600--1925 A.D.), showing similarities and disparities across poetic traditions with a few select topics, and use their trajectories over time to pinpoint specific literary epochs

    Yet Another Format of Universal Dependencies for Korean

    Full text link
    In this study, we propose a morpheme-based scheme for Korean dependency parsing and adopt the proposed scheme to Universal Dependencies. We present the linguistic rationale that illustrates the motivation and the necessity of adopting the morpheme-based format, and develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically. The effectiveness of the proposed format for Korean dependency parsing is then testified by both statistical and neural models, including UDPipe and Stanza, with our carefully constructed morpheme-based word embedding for Korean. morphUD outperforms parsing results for all Korean UD treebanks, and we also present detailed error analyses.Comment: COLING2022, Poste

    Koditex — korpus diverzifikovaných textů

    Get PDF
    12713

    Metre and Semantics in the Poetry of Czech Post-Symbolists Accessed via LDA Topic Modelling

    Get PDF
    The article deals with the relationship between semantics and poetic meter in the works of Czech post-symbolist poets and their predecessors. We access the phenomena by means of a machine-driven meter recognition on one hand and LDA topic modelling on the other. We first show how the poetic groups differ in their general preferences for particular topics. Next we analyze the topic distributions in two dominant metres (i.e. iamb and trochee) across the poetic groups

    SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task

    Get PDF
    This paper presents the description of 12 systems submitted to the WMT16 IT-task, covering six different languages, namely Basque, Bulgarian, Dutch, Czech, Portuguese and Spanish. All these systems were developed under the scope of the QTLeap project, presenting a common strategy. For each language two different systems were submitted, namely a phrase-based MT system built using Moses, and a system exploiting deep language engineering approaches, that in all the languages but Bulgarian was implemented using TectoMT. For 4 of the 6 languages, the TectoMT-based system performs better than the Moses-based one

    ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation

    Full text link
    We present ParaBank, a large-scale English paraphrase dataset that surpasses prior work in both quantity and quality. Following the approach of ParaNMT, we train a Czech-English neural machine translation (NMT) system to generate novel paraphrases of English reference sentences. By adding lexical constraints to the NMT decoding procedure, however, we are able to produce multiple high-quality sentential paraphrases per source sentence, yielding an English paraphrase resource with more than 4 billion generated tokens and exhibiting greater lexical diversity. Using human judgments, we also demonstrate that ParaBank's paraphrases improve over ParaNMT on both semantic similarity and fluency. Finally, we use ParaBank to train a monolingual NMT model with the same support for lexically-constrained decoding for sentence rewriting tasks.Comment: To be presented at AAAI 2019. 8 page

    A Latent Morphology Model for Open-Vocabulary Neural Machine Translation

    Get PDF
    Translation into morphologically-rich languages challenges neural machine translation (NMT) models with extremely sparse vocabularies where atomic treatment of surface forms is unrealistic. This problem is typically addressed by either pre-processing words into subword units or performing translation directly at the level of characters. The former is based on word segmentation algorithms optimized using corpus-level statistics with no regard to the translation task. The latter learns directly from translation data but requires rather deep architectures. In this paper, we propose to translate words by modeling word formation through a hierarchical latent variable model which mimics the process of morphological inflection. Our model generates words one character at a time by composing two latent representations: a continuous one, aimed at capturing the lexical semantics, and a set of (approximately) discrete features, aimed at capturing the morphosyntactic function, which are shared among different surface forms. Our model achieves better accuracy in translation into three morphologically-rich languages than conventional open-vocabulary NMT methods, while also demonstrating a better generalization capacity under low to mid-resource settings.Comment: Published at ICLR 202

    System for Interlinking Texts of State Exam Topics, Learning Support- and Other Supplementary Materials

    Get PDF
    Hlavním úkolem této práce je se seznámit s metodami vyhledávání definic odborných pojmů napříč texty. Následně navrhnout a vytvořit systém, který bude schopen propojit texty státnicových témat, studijních opor a doplňkových materiálů. Na závěr vyhodnotit vytvořený systém na materiálech z VUT FIT v Brně a zhodnotit výsledky vzhledem k použitelnosti výstupů pro přípravu studentů k závěrečným zkouškám.The main goal of this thesis is to survey methods which are used for keyword extraction from articles and text documents. After that design and create system, which will be able to interlink texts of state exam topics, learning support and other supplementary materials. Finally step is evaluate the created system to materials from VUT FIT in Brno and appraise results in applicability for preparing students for final exams. 
    corecore