8,333 research outputs found
An Analysis of Source-Side Grammatical Errors in NMT
The quality of Neural Machine Translation (NMT) has been shown to
significantly degrade when confronted with source-side noise. We present the
first large-scale study of state-of-the-art English-to-German NMT on real
grammatical noise, by evaluating on several Grammar Correction corpora. We
present methods for evaluating NMT robustness without true references, and we
use them for extensive analysis of the effects that different grammatical
errors have on the NMT output. We also introduce a technique for visualizing
the divergence distribution caused by a source-side error, which allows for
additional insights.Comment: Accepted and to be presented at BlackboxNLP 201
Basque-to-Spanish and Spanish-to-Basque machine translation for the health domain
[EU]Master Amaierako Lan honek medikuntza domeinuko euskara eta gaztelera arteko itzulpen automatiko sistema bat garatzeko helburuarekin emandako lehenengo urratsak aurkezten ditu. Corpus elebidun nahikoaren faltan, hainbat esperimentu burutu dira Itzulpen Automatiko Neuronalean erabiltzen diren parametroak domeinuz kanpoko corpusean aztertzeko; medikuntza domeinuan izandako jokaera ebaluatzeko ordea, eskuz itzulitako corpusa erabili da medikuntza domeinuko corpusen presentzia handituz entrenatutako sistema desberdinak probatzeko. Lortutako emaitzek deskribatutako helbururako bidean lehenengo aurrerapausoa suposatzen dute.[EN]This project presents the initial steps towards the objective of
developing a Machine Translation system for the health domain between
Basque and Spanish. In the absence of a big enough bilingual corpus,
several experiments have been carried out to test different Neural
Machine Translation parameters on an out-of-domain corpus; while
performance on the health domain has been evaluated with a manually
translated corpus in different systems trained with increasing presence
of health domain corpora. The results obtained represent a first step
forward to the described objective
Coherent Multi-Sentence Video Description with Variable Level of Detail
Humans can easily describe what they see in a coherent way and at varying
level of detail. However, existing approaches for automatic video description
are mainly focused on single sentence generation and produce descriptions at a
fixed level of detail. In this paper, we address both of these limitations: for
a variable level of detail we produce coherent multi-sentence descriptions of
complex videos. We follow a two-step approach where we first learn to predict a
semantic representation (SR) from video and then generate natural language
descriptions from the SR. To produce consistent multi-sentence descriptions, we
model across-sentence consistency at the level of the SR by enforcing a
consistent topic. We also contribute both to the visual recognition of objects
proposing a hand-centric approach as well as to the robust generation of
sentences using a word lattice. Human judges rate our multi-sentence
descriptions as more readable, correct, and relevant than related work. To
understand the difference between more detailed and shorter descriptions, we
collect and analyze a video description corpus of three levels of detail.Comment: 10 page
Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation
Recent works in spoken language translation (SLT) have attempted to build
end-to-end speech-to-text translation without using source language
transcription during learning or decoding. However, while large quantities of
parallel texts (such as Europarl, OpenSubtitles) are available for training
machine translation systems, there are no large (100h) and open source parallel
corpora that include speech in a source language aligned to text in a target
language. This paper tries to fill this gap by augmenting an existing
(monolingual) corpus: LibriSpeech. This corpus, used for automatic speech
recognition, is derived from read audiobooks from the LibriVox project, and has
been carefully segmented and aligned. After gathering French e-books
corresponding to the English audio-books from LibriSpeech, we align speech
segments at the sentence level with their respective translations and obtain
236h of usable parallel data. This paper presents the details of the processing
as well as a manual evaluation conducted on a small subset of the corpus. This
evaluation shows that the automatic alignments scores are reasonably correlated
with the human judgments of the bilingual alignment quality. We believe that
this corpus (which is made available online) is useful for replicable
experiments in direct speech translation or more general spoken language
translation experiments.Comment: LREC 2018, Japa
- …