10 research outputs found

    A Self-Supervised Automatic Post-Editing Data Generation Tool

    Full text link
    Data building for automatic post-editing (APE) requires extensive and expert-level human effort, as it contains an elaborate process that involves identifying errors in sentences and providing suitable revisions. Hence, we develop a self-supervised data generation tool, deployable as a web application, that minimizes human supervision and constructs personalized APE data from a parallel corpus for several language pairs with English as the target language. Data-centric APE research can be conducted using this tool, involving many language pairs that have not been studied thus far owing to the lack of suitable data.Comment: Accepted for DataPerf workshop at ICML 202

    QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

    Full text link
    With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M

    Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

    No full text
    Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with finetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa

    BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

    No full text
    Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance

    Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

    No full text
    The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance

    Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction

    No full text
    In a Data-Centric AI paradigm, the model performance is enhanced without altering the model architecture, as evidenced by real-world and benchmark dataset demonstrations. With the advancements of large language models (LLM), it has become increasingly feasible to generate high-quality synthetic data, while considering the need to construct fully synthetic datasets for real-world data containing numerous personal information. However, in-depth validation of the solely synthetic data setting has yet to be conducted, despite the increased possibility of models trained exclusively on fully synthetic data emerging in the future. Therefore, we examined the question, “Do data quality control techniques (known to positively impact data-centric AI) consistently aid models trained exclusively on synthetic datasets?”. To explore this query, we performed detailed analyses using synthetic datasets generated for speech recognition postprocessing using the BackTranScription (BTS) approach. Our study primarily addressed the potential adverse effects of data quality control measures (e.g., noise injection and balanced data) and training strategies in the context of synthetic-only experiments. As a result of the experiment, we observed the negative effect that the data-centric methodology drops by a maximum of 44.03 points in the fully synthetic data setting

    Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

    No full text
    The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance

    A Survey on Evaluation Metrics for Machine Translation

    No full text
    The success of Transformer architecture has seen increased interest in machine translation (MT). The translation quality of neural network-based MT transcends that of translations derived using statistical methods. This growth in MT research has entailed the development of accurate automatic evaluation metrics that allow us to track the performance of MT. However, automatically evaluating and comparing MT systems is a challenging task. Several studies have shown that traditional metrics (e.g., BLEU, TER) show poor performance in capturing semantic similarity between MT outputs and human reference translations. To date, to improve performance, various evaluation metrics have been proposed using the Transformer architecture. However, a systematic and comprehensive literature review on these metrics is still missing. Therefore, it is necessary to survey the existing automatic evaluation metrics of MT to enable both established and new researchers to quickly understand the trend of MT evaluation over the past few years. In this survey, we present the trend of automatic evaluation metrics. To better understand the developments in the field, we provide the taxonomy of the automatic evaluation metrics. Then, we explain the key contributions and shortcomings of the metrics. In addition, we select the representative metrics from the taxonomy, and conduct experiments to analyze related problems. Finally, we discuss the limitation of the current automatic metric studies through the experimentation and our suggestions for further research to improve the automatic evaluation metrics

    Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model

    No full text
    Return on advertising spend (ROAS) refers to the ratio of revenue generated by advertising projects to its expense. It is used to assess the effectiveness of advertising marketing. Several simulation-based controlled experiments, such as geo experiments, have been proposed recently. This refers to calculating ROAS by dividing a geographic region into a control group and a treatment group and comparing the ROAS generated in each group. However, the data collected through these experiments can only be used to analyze previously constructed data, making it difficult to use in an inductive process that predicts future profits or costs. Furthermore, to obtain ROAS for each advertising group, data must be collected under a new experimental setting each time, suggesting that there is a limitation in using previously collected data. Considering these, we present a method for predicting ROAS that does not require controlled experiments in data acquisition and validates its effectiveness through comparative experiments. Specifically, we propose a task deposition method that divides the end-to-end prediction task into the two-stage process: occurrence prediction and occurred ROAS regression. Through comparative experiments, we reveal that these approaches can effectively deal with the advertising data, in which the label is mainly set to zero-label
    corecore