34 research outputs found
Mapping Topic Evolution Across Poetic Traditions
Poetic traditions across languages evolved differently, but we find that
certain semantic topics occur in several of them, albeit sometimes with
temporal delay, or with diverging trajectories over time. We apply Latent
Dirichlet Allocation (LDA) to poetry corpora of four languages, i.e. German
(52k poems), English (85k poems), Russian (18k poems), and Czech (80k poems).
We align and interpret salient topics, their trend over time (1600--1925 A.D.),
showing similarities and disparities across poetic traditions with a few select
topics, and use their trajectories over time to pinpoint specific literary
epochs
Yet Another Format of Universal Dependencies for Korean
In this study, we propose a morpheme-based scheme for Korean dependency
parsing and adopt the proposed scheme to Universal Dependencies. We present the
linguistic rationale that illustrates the motivation and the necessity of
adopting the morpheme-based format, and develop scripts that convert between
the original format used by Universal Dependencies and the proposed
morpheme-based format automatically. The effectiveness of the proposed format
for Korean dependency parsing is then testified by both statistical and neural
models, including UDPipe and Stanza, with our carefully constructed
morpheme-based word embedding for Korean. morphUD outperforms parsing results
for all Korean UD treebanks, and we also present detailed error analyses.Comment: COLING2022, Poste
Metre and Semantics in the Poetry of Czech Post-Symbolists Accessed via LDA Topic Modelling
The article deals with the relationship between semantics and poetic meter in the works of Czech post-symbolist poets and their predecessors. We access the phenomena by means of a machine-driven meter recognition on one hand and LDA topic modelling on the other. We first show how the poetic groups differ in their general preferences for particular topics. Next we analyze the topic distributions in two dominant metres (i.e. iamb and trochee) across the poetic groups
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
This paper presents the description of 12
systems submitted to the WMT16 IT-task,
covering six different languages, namely
Basque, Bulgarian, Dutch, Czech, Portuguese
and Spanish. All these systems
were developed under the scope of the
QTLeap project, presenting a common
strategy. For each language two different
systems were submitted, namely a phrase-based
MT system built using Moses, and
a system exploiting deep language engineering
approaches, that in all the languages
but Bulgarian was implemented
using TectoMT. For 4 of the 6 languages,
the TectoMT-based system performs better
than the Moses-based one
ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation
We present ParaBank, a large-scale English paraphrase dataset that surpasses
prior work in both quantity and quality. Following the approach of ParaNMT, we
train a Czech-English neural machine translation (NMT) system to generate novel
paraphrases of English reference sentences. By adding lexical constraints to
the NMT decoding procedure, however, we are able to produce multiple
high-quality sentential paraphrases per source sentence, yielding an English
paraphrase resource with more than 4 billion generated tokens and exhibiting
greater lexical diversity. Using human judgments, we also demonstrate that
ParaBank's paraphrases improve over ParaNMT on both semantic similarity and
fluency. Finally, we use ParaBank to train a monolingual NMT model with the
same support for lexically-constrained decoding for sentence rewriting tasks.Comment: To be presented at AAAI 2019. 8 page
A Latent Morphology Model for Open-Vocabulary Neural Machine Translation
Translation into morphologically-rich languages challenges neural machine
translation (NMT) models with extremely sparse vocabularies where atomic
treatment of surface forms is unrealistic. This problem is typically addressed
by either pre-processing words into subword units or performing translation
directly at the level of characters. The former is based on word segmentation
algorithms optimized using corpus-level statistics with no regard to the
translation task. The latter learns directly from translation data but requires
rather deep architectures. In this paper, we propose to translate words by
modeling word formation through a hierarchical latent variable model which
mimics the process of morphological inflection. Our model generates words one
character at a time by composing two latent representations: a continuous one,
aimed at capturing the lexical semantics, and a set of (approximately) discrete
features, aimed at capturing the morphosyntactic function, which are shared
among different surface forms. Our model achieves better accuracy in
translation into three morphologically-rich languages than conventional
open-vocabulary NMT methods, while also demonstrating a better generalization
capacity under low to mid-resource settings.Comment: Published at ICLR 202
System for Interlinking Texts of State Exam Topics, Learning Support- and Other Supplementary Materials
Hlavním úkolem této práce je se seznámit s metodami vyhledávání definic odborných pojmů napříč texty. Následně navrhnout a vytvořit systém, který bude schopen propojit texty státnicových témat, studijních opor a doplňkových materiálů. Na závěr vyhodnotit vytvořený systém na materiálech z VUT FIT v Brně a zhodnotit výsledky vzhledem k použitelnosti výstupů pro přípravu studentů k závěrečným zkouškám.The main goal of this thesis is to survey methods which are used for keyword extraction from articles and text documents. After that design and create system, which will be able to interlink texts of state exam topics, learning support and other supplementary materials. Finally step is evaluate the created system to materials from VUT FIT in Brno and appraise results in applicability for preparing students for final exams.
The WMT'18 Morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English
Peer reviewe