18 research outputs found
Evaluation of Automatic Video Captioning Using Direct Assessment
We present Direct Assessment, a method for manually assessing the quality of
automatically-generated captions for video. Evaluating the accuracy of video
captions is particularly difficult because for any given video clip there is no
definitive ground truth or correct answer against which to measure. Automatic
metrics for comparing automatic video captions against a manual caption such as
BLEU and METEOR, drawn from techniques used in evaluating machine translation,
were used in the TRECVid video captioning task in 2016 but these are shown to
have weaknesses. The work presented here brings human assessment into the
evaluation by crowdsourcing how well a caption describes a video. We
automatically degrade the quality of some sample captions which are assessed
manually and from this we are able to rate the quality of the human assessors,
a factor we take into account in the evaluation. Using data from the TRECVid
video-to-text task in 2016, we show how our direct assessment method is
replicable and robust and should scale to where there many caption-generation
techniques to be evaluated.Comment: 26 pages, 8 figure
TŁUMACZENIE MASZYNOWE – CZY MOŻE WSPOMÓC PROFESJONALNY PRZEKŁAD UMÓW?
The aim of this research project is to verify whether machine translation (MT) technology can be utilized in the process of professional translation. The genre to be tested in this study is a legal contract. It is a non-literary text, with a high rate of repeatable phrases, predictable lexis, culture-bound terms and syntactically complex sentences (Šarčević 2000, Berezowski 2008). The subject of this study is MT software available on the market that supports the English-Polish language pair: Google MT and Microsoft MT. During the experiment, the process of post-editing of MT raw output was recorded and then analysed in order to retrieve the following data:(i) number of errors in MT raw output,(ii) types of errors (syntactic, grammatical, lexical) and their frequency,(iii) degree of fidelity to the original text (frequency of meaning omissions and meaning distortions), (iv) time devoted to the editing process of the MT raw output.The research results should help translators make an informed decision whether they would like to invite MT into their work environment.Niniejszy projekt badawczy ma na celu wykazanie czy jakość tłumaczenia maszynowego jest na tyle dobra, by mogło być ono wykorzystywane podczas pracy profesjonalnego tłumacza prawniczego. Podczas badania analizie poddane zostały umowy – teksty użytkowe charakteryzujące się wysoką powtarzalnością wyrażeń, zwrotów i terminów, złożoną składnią oraz nieprzystawalnością terminologiczną (Šarčević 2000, Berezowski 2008). Przyjęta metoda badawcza polegała na nagraniu procesu tłumaczenia przy zastosowaniu narzędzi Google MT oraz Microsoft MT. Badanie umożliwiło wydobycie informacji na temat użyteczności tłumaczenia maszynowego poprzez określenie: (i) rodzaju błędów występujących w tekście wygenerowanym przez tłumacza maszynowego,(ii) częstotliwości występowania błędów,(iii) zgodności merytorycznej z treścią oryginału (liczba pominięć oraz zniekształceń),(iv) czasu poświęconego na edycję tekstu wygenerowanego przez tłumacza maszynowego.Wyniki badania powinny pomóc tłumaczom w podjęciu świadomej decyzji czy chcieliby włączyć tłumaczenie maszynowe do swojego warsztatu pracy
Machine translation : can it assist in professional translation of contracts?
The aim of this research project is to verify whether machine translation (MT)
technology can be utilized in the process of professional translation. The genre to be tested in this
study is a legal contract. It is a non-literary text, with a high rate of repeatable phrases, predictable
lexis, culture-bound terms and syntactically complex sentences (Šar evi 2000, Berezowski 2008).
The subject of this study is MT software available on the market that supports the English-Polish
language pair: Google MT and Microsoft MT. During the experiment, the process of post-editing
of MT raw output was recorded and then analysed in order to retrieve the following data:
(i) number of errors in MT raw output,
(ii) types of errors (syntactic, grammatical, lexical) and their frequency,
(iii) degree of fidelity to the original text (frequency of meaning omissions and meaning distortions),
(iv) time devoted to the editing process of the MT raw output.
The research results should help translators make an informed decision whether they would like to
invite MT into their work environment
GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation
Leaderboards have eased model development for many NLP datasets by
standardizing their evaluation and delegating it to an independent external
repository. Their adoption, however, is so far limited to tasks that can be
reliably evaluated in an automatic manner. This work introduces GENIE, an
extensible human evaluation leaderboard, which brings the ease of leaderboards
to text generation tasks. GENIE automatically posts leaderboard submissions to
crowdsourcing platforms asking human annotators to evaluate them on various
axes (e.g., correctness, conciseness, fluency) and compares their answers to
various automatic metrics. We introduce several datasets in English to GENIE,
representing four core challenges in text generation: machine translation,
summarization, commonsense reasoning, and machine comprehension. We provide
formal granular evaluation metrics and identify areas for future research. We
make GENIE publicly available and hope that it will spur progress in language
generation models as well as their automatic and manual evaluation