7 research outputs found
Bring More Attention to Syntactic Symmetry for Automatic Postediting of High-Quality Machine Translations
Automatic postediting (APE) is an automated process to refine a given machine
translation (MT). Recent findings present that existing APE systems are not
good at handling high-quality MTs even for a language pair with abundant data
resources, English-to-German: the better the given MT is, the harder it is to
decide what parts to edit and how to fix these errors. One possible solution to
this problem is to instill deeper knowledge about the target language into the
model. Thus, we propose a linguistically motivated method of regularization
that is expected to enhance APE models' understanding of the target language: a
loss function that encourages symmetric self-attention on the given MT. Our
analysis of experimental results demonstrates that the proposed method helps
improving the state-of-the-art architecture's APE quality for high-quality MTs.Comment: This paper is presented at ACL 202
Towards Semi-Supervised Learning of Automatic Post-Editing: Data-Synthesis by Infilling Mask with Erroneous Tokens
Semi-supervised learning that leverages synthetic training data has been
widely adopted in the field of Automatic post-editing (APE) to overcome the
lack of human-annotated training data. In that context, data-synthesis methods
to create high-quality synthetic data have also received much attention.
Considering that APE takes machine-translation outputs containing translation
errors as input, we propose a noising-based data-synthesis method that uses a
mask language model to create noisy texts through substituting masked tokens
with erroneous tokens, yet following the error-quantity statistics appearing in
genuine APE data. In addition, we propose corpus interleaving, which is to
combine two separate synthetic data by taking only advantageous samples, to
further enhance the quality of the synthetic data created with our noising
method. Experimental results reveal that using the synthetic data created with
our approach results in significant improvements in APE performance upon using
other synthetic data created with different existing data-synthesis methods
Denoising Table-Text Retrieval for Open-Domain Question Answering
In table-text open-domain question answering, a retriever system retrieves
relevant evidence from tables and text to answer questions. Previous studies in
table-text open-domain question answering have two common challenges: firstly,
their retrievers can be affected by false-positive labels in training datasets;
secondly, they may struggle to provide appropriate evidence for questions that
require reasoning across the table. To address these issues, we propose
Denoised Table-Text Retriever (DoTTeR). Our approach involves utilizing a
denoised training dataset with fewer false positive labels by discarding
instances with lower question-relevance scores measured through a false
positive detection model. Subsequently, we integrate table-level ranking
information into the retriever to assist in finding evidence for questions that
demand reasoning across the table. To encode this ranking information, we
fine-tune a rank-aware column encoder to identify minimum and maximum values
within a column. Experimental results demonstrate that DoTTeR significantly
outperforms strong baselines on both retrieval recall and downstream QA tasks.
Our code is available at https://github.com/deokhk/DoTTeR.Comment: Accepted to LREC-COLING 202
Exploration of Effective Attention Strategies for Neural Automatic Post-editing with Transformer
Automatic post-editing (APE) is the study of correcting translation errors in the output of an unknown machine translation (MT) system and has been considered as a method of improving translation quality without any modification to conventional MT systems. Recently, several variants of Transformer that take both the MT output and its corresponding source sentence as inputs have been proposed for APE; and models introducing an additional attention layer into the encoder to jointly encode the MT output with its source sentence recorded a high-rank in the WMT19 APE shared task. We examine the effectiveness of such joint-encoding strategy in a controlled environment and compare four types of decoder multi-source attention strategies that have been introduced into previous APE models. The experimental results indicate that the joint-encoding strategy is effective and that taking the final encoded representation of the source sentence is the more proper strategy than taking such representation within the same encoder stack. Furthermore, among the multi-source attention strategies combined with the joint-encoding, the strategy that applies attention to the concatenated input representation and the strategy that adds up the individual attention to each input improve the quality of APE results over the strategy using the joint-encoding only.11Nsciescopu