Search CORE

364 research outputs found

Trivial Transfer Learning for Low-Resource Neural Machine Translation

Author: Bojar Ondřej
Kocmi Tom
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Transfer learning has been proven as an effective technique for neural machine translation under low-resource conditions. Existing methods require a common target language, language relatedness, or specific training tricks and regimes. We present a simple transfer learning method, where we first train a "parent" model for a high-resource language pair and then continue the training on a lowresource pair only by replacing the training corpus. This "child" model performs significantly better than the baseline trained for lowresource pair only. We are the first to show this for targeting different languages, and we observe the improvements even for unrelated languages with different alphabets.Comment: Accepted to WMT18 reseach paper, Proceedings of the 3rd Conference on Machine Translation 201

arXiv.org e-Print Archive

Crossref

Analyzing Error Types in English-Czech Machine Translation

Author: Bojar Ondřej
Publication venue
Publication date: 01/01/2011
Field of study

This paper examines two techniques of manual evaluation that can be used to identify error types of individual machine translation systems. The first technique of “blind post-editing” is being used in WMT evaluation campaigns since 2009 and manually constructed data of this type are available for various language pairs. The second technique of explicit marking of errors has been used in the past as well. We propose a method for interpreting blind post-editing data at a finer level and compare the results with explicit marking of errors. While the human annotation of either of the techniques is not exactly reproducible (relatively low agreement), both techniques lead to similar observations of differences of the systems. Specifically, we are able to suggest which errors in MT output are easy and hard to correct with no access to the source, a situation experienced by users who do not understand the source language

CiteSeerX

Crossref

Biblio at Institute of Formal and Applied Linguistics

Are BLEU and Meaning Representation in Opposition?

Author: Bojar Ondřej
Cífka Ondřej
Publication venue
Publication date: 01/01/2018
Field of study

One of possible ways of obtaining continuous-space sentence representations is by training neural machine translation (NMT) systems. The recent attention mechanism however removes the single point in the neural network from which the source sentence representation can be extracted. We propose several variations of the attentive NMT architecture bringing this meeting point back. Empirical evaluation suggests that the better the translation quality, the worse the learned sentence representations serve in a wide range of classification and similarity tasks.Comment: ACL 2018; 10 pages + 2 page supplementar

arXiv.org e-Print Archive

DESY Publication Database

Directory of Open Access Journals

DESY

Are BLEU and Meaning Representation in Opposition?

Author: Bojar Ondřej
Cífka Ondřej
Publication venue
Publication date: 01/01/2018
Field of study

arXiv.org e-Print Archive

Crossref

Giving a Sense: A Pilot Study in Concept Annotation from Multiple Resources

Author: Bojar Ondřej
Sudarikov Roman
Publication venue
Publication date: 01/01/2015
Field of study

We present a pilot study of a web-based annotation of words with senses. The annotated senses come from several knowledge bases and sense inventories. The study is the first step in a planned larger annotation of grounding and should allow us to select a subset of the sense sources that cover any given text reasonably well and show an acceptable level of inter-annotator agreement

Biblio at Institute of Formal and Applied Linguistics

Minuteman: Machine and Human Joining Forces in Meeting Summarization

Author: Bojar Ondřej
Kmječ František
Publication venue
Publication date: 11/09/2023
Field of study

Many meetings require creating a meeting summary to keep everyone up to date. Creating minutes of sufficient quality is however very cognitively demanding. Although we currently possess capable models for both audio speech recognition (ASR) and summarization, their fully automatic use is still problematic. ASR models frequently commit errors when transcribing named entities while the summarization models tend to hallucinate and misinterpret the transcript. We propose a novel tool -- Minuteman -- to enable efficient semi-automatic meeting minuting. The tool provides a live transcript and a live meeting summary to the users, who can edit them in a collaborative manner, enabling correction of ASR errors and imperfect summary points in real time. The resulting application eases the cognitive load of the notetakers and allows them to easily catch up if they missed a part of the meeting due to absence or a lack of focus. We conduct several tests of the application in varied settings, exploring the worthiness of the concept and the possible user strategies.Comment: 6 pages, 3 figure

arXiv.org e-Print Archive

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data

Author: Bojar Ondřej
Kvapilíková Ivana
Publication venue
Publication date: 22/10/2023
Field of study

Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any translation resources but the quality lags behind, especially in truly low-resource conditions. We propose a training strategy that relies on pseudo-parallel sentence pairs mined from monolingual corpora in addition to synthetic sentence pairs back-translated from monolingual corpora. We experiment with different training schedules and reach an improvement of up to 14.5 BLEU points (English to Ukrainian) over a baseline trained on back-translated data only.Comment: MT Summit 202

arXiv.org e-Print Archive

The Design of Eman, an Experiment Manager

Author: Bojar Ondřej
Tamchyna Aleš
Publication venue
Publication date: 01/01/2013
Field of study

We present eman, a tool for managing large numbers of computational experiments. Over the years of our research in machine translation (MT), we have collected a couple of ideas for efficient experimenting. We believe these ideas are generally applicable in (computational) research of any field. We incorporated them into eman in order to make them available in a command-line Unix environment. The aim of this article is to highlight the core of the many ideas. We hope the text can serve as a collection of experiment management tips and tricks for anyone, regardless their field of study or computer platform they use. The specific examples we provide in eman’s current syntax are less important but they allow us to use concrete terms. The article thus also fills the gap in eman documentation by providing some high-level overview

CiteSeerX

Crossref

Biblio at Institute of Formal and Applied Linguistics