Search CORE

10 research outputs found

A Self-Supervised Automatic Post-Editing Data Generation Tool

Author: Eo Sugyeong
Lee SeungJun
Lim Heuiseok
Moon Hyeonseok
Park Chanjun
Seo Jaehyung
Publication venue
Publication date: 09/06/2022
Field of study

Data building for automatic post-editing (APE) requires extensive and expert-level human effort, as it contains an elaborate process that involves identifying errors in sentences and providing suitable revisions. Hence, we develop a self-supervised data generation tool, deployable as a web application, that minimizes human supervision and constructs personalized APE data from a parallel corpus for several language pairs with English as the target language. Data-centric APE research can be conducted using this tool, involving many language pairs that have not been studied thus far owing to the lack of suitable data.Comment: Accepted for DataPerf workshop at ICML 202

arXiv.org e-Print Archive

QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

Author: Eo Sugyeong
Kim Gyeongmin
Lee Jungseob
Lim Heuiseok
Moon Hyeonseok
Park Chanjun
Seo Jaehyung
Publication venue
Publication date: 30/09/2022
Field of study

With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M

arXiv.org e-Print Archive

Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

Author: Chanjun Park
Heuiseok Lim
Hyeonseok Moon
Jaehyung Seo
Sugyeong Eo
Publication venue: 'MDPI AG'
Publication date: 01/07/2021
Field of study

Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with finetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

Author: Chanjun Park
Heuiseok Lim
Hyeonseok Moon
Jaehyung Seo
Jeongwoo Lee
Sugyeong Eo
Publication venue: MDPI AG
Publication date: 01/07/2022
Field of study

Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance

Directory of Open Access Journals

Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

Author: Chanjun Park
Heuiseok Lim
Hyeonseok Moon
Jaehyung Seo
Midan Shim
Seolhwa Lee
Sugyeong Eo
Publication venue: 'MDPI AG'
Publication date: 30/05/2022
Field of study

The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance

Multidisciplinary Digital Publishing Institute

Mimicking Infants' Bilingual Language Acquisition for Domain Specialized Neural Machine Translation

Author: Eo Sugyeong
Go Woo Young
Lee Seolhwa
Lim Heuiseok
Moon Hyeonseok
Park Chanjun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Copenhagen University Research Information System

Uncovering the Risks and Drawbacks Associated With the Use of Synthetic Data for Grammatical Error Correction

Author: Chanjun Park
Heuiseok Lim
Hyeonseok Moon
Jaehyung Seo
Seolhwa Lee
Seonmin Koo
Sugyeong Eo
Publication venue: IEEE
Publication date: 01/01/2023
Field of study

In a Data-Centric AI paradigm, the model performance is enhanced without altering the model architecture, as evidenced by real-world and benchmark dataset demonstrations. With the advancements of large language models (LLM), it has become increasingly feasible to generate high-quality synthetic data, while considering the need to construct fully synthetic datasets for real-world data containing numerous personal information. However, in-depth validation of the solely synthetic data setting has yet to be conducted, despite the increased possibility of models trained exclusively on fully synthetic data emerging in the future. Therefore, we examined the question, “Do data quality control techniques (known to positively impact data-centric AI) consistently aid models trained exclusively on synthetic datasets?”. To explore this query, we performed detailed analyses using synthetic datasets generated for speech recognition postprocessing using the BackTranScription (BTS) approach. Our study primarily addressed the potential adverse effects of data quality control measures (e.g., noise injection and balanced data) and training strategies in the context of synthetic-only experiments. As a result of the experiment, we observed the negative effect that the data-centric methodology drops by a maximum of 44.03 points in the fully synthetic data setting

Directory of Open Access Journals

Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

Author: Chanjun Park
Heuiseok Lim
Hyeonseok Moon
Jaehyung Seo
Midan Shim
Seolhwa Lee
Sugyeong Eo
Publication venue: MDPI AG
Publication date: 01/05/2022
Field of study

Directory of Open Access Journals

A Survey on Evaluation Metrics for Machine Translation

Author: Chanjun Park
Heuiseok Lim
Hyeonseok Moon
Jaehyung Seo
Jungseob Lee
Seonmin Koo
Seungjun Lee
Sugyeong Eo
Publication venue: MDPI AG
Publication date: 01/02/2023
Field of study

The success of Transformer architecture has seen increased interest in machine translation (MT). The translation quality of neural network-based MT transcends that of translations derived using statistical methods. This growth in MT research has entailed the development of accurate automatic evaluation metrics that allow us to track the performance of MT. However, automatically evaluating and comparing MT systems is a challenging task. Several studies have shown that traditional metrics (e.g., BLEU, TER) show poor performance in capturing semantic similarity between MT outputs and human reference translations. To date, to improve performance, various evaluation metrics have been proposed using the Transformer architecture. However, a systematic and comprehensive literature review on these metrics is still missing. Therefore, it is necessary to survey the existing automatic evaluation metrics of MT to enable both established and new researchers to quickly understand the trend of MT evaluation over the past few years. In this survey, we present the trend of automatic evaluation metrics. To better understand the developments in the field, we provide the taxonomy of the automatic evaluation metrics. Then, we explain the key contributions and shortcomings of the metrics. In addition, we select the representative metrics from the taxonomy, and conduct experiments to analyze related problems. Finally, we discuss the limitation of the current automatic metric studies through the experimentation and our suggestions for further research to improve the automatic evaluation metrics

Directory of Open Access Journals

Return on Advertising Spend Prediction with Task Decomposition-Based LSTM Model

Author: Aram So
Chanjun Park
Hyeonseok Moon
Imatitikua D. Aiyanyo
Jaehyung Seo
Jeongbae Park
Kinam Park
Kyoungwha Ok
Sugyeong Eo
Taemin Lee
Publication venue: MDPI AG
Publication date: 01/05/2022
Field of study

Return on advertising spend (ROAS) refers to the ratio of revenue generated by advertising projects to its expense. It is used to assess the effectiveness of advertising marketing. Several simulation-based controlled experiments, such as geo experiments, have been proposed recently. This refers to calculating ROAS by dividing a geographic region into a control group and a treatment group and comparing the ROAS generated in each group. However, the data collected through these experiments can only be used to analyze previously constructed data, making it difficult to use in an inductive process that predicts future profits or costs. Furthermore, to obtain ROAS for each advertising group, data must be collected under a new experimental setting each time, suggesting that there is a limitation in using previously collected data. Considering these, we present a method for predicting ROAS that does not require controlled experiments in data acquisition and validates its effectiveness through comparative experiments. Specifically, we propose a task deposition method that divides the end-to-end prediction task into the two-stage process: occurrence prediction and occurred ROAS regression. Through comparative experiments, we reveal that these approaches can effectively deal with the advertising data, in which the label is mainly set to zero-label

Directory of Open Access Journals