Search CORE

49 research outputs found

Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation

Author: Lee Seugnjun
Lim Heuiseok
Moon Hyeonseok
Park Chanjun
Publication venue
Publication date: 27/06/2023
Field of study

In this paper, we introduce a data-driven approach for Formality-Sensitive Machine Translation (FSMT) that caters to the unique linguistic properties of four target languages. Our methodology centers on two core strategies: 1) language-specific data handling, and 2) synthetic data generation using large-scale language models and empirical prompt engineering. This approach demonstrates a considerable improvement over the baseline, highlighting the effectiveness of data-centric techniques. Our prompt engineering strategy further improves performance by producing superior synthetic translation examples.Comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop at ICML 202

arXiv.org e-Print Archive

A Self-Supervised Automatic Post-Editing Data Generation Tool

Author: Eo Sugyeong
Lee SeungJun
Lim Heuiseok
Moon Hyeonseok
Park Chanjun
Seo Jaehyung
Publication venue
Publication date: 09/06/2022
Field of study

Data building for automatic post-editing (APE) requires extensive and expert-level human effort, as it contains an elaborate process that involves identifying errors in sentences and providing suitable revisions. Hence, we develop a self-supervised data generation tool, deployable as a web application, that minimizes human supervision and constructs personalized APE data from a parallel corpus for several language pairs with English as the target language. Data-centric APE research can be conducted using this tool, involving many language pairs that have not been studied thus far owing to the lack of suitable data.Comment: Accepted for DataPerf workshop at ICML 202

arXiv.org e-Print Archive

A Study on the Development of Game-based Mind Wandering Judgment Model in Video Lecture-based Education

Author: Jo Jaechoon
Lim Heuiseok
Yang Yeongwook
Publication venue: 'Taiwan Association of Engineering and Technology Innovation'
Publication date: 11/10/2018
Field of study

Although video lecture materials are very efficient learning materials, they are likely to be unilateral learning materials by the lecturer. It is easily degraded to be one-sided learning, which has been considered as a problem of online education, and it is difficult to judge whether learners are actually learning. Therefore, in this paper, a minimum learning activity judgment model that can automatically determine if they actually learn through mind wandering judgment was proposed to overcome the limitations of previous learning materials, and educational effect verification experiment was performed. Experiment results show that the video lecture class using the minimum learning activity judgment system was effective in improving the academic achievement

Taiwan Association of Engineering and Technology Innovation: E-Journals

QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

Author: Eo Sugyeong
Kim Gyeongmin
Lee Jungseob
Lim Heuiseok
Moon Hyeonseok
Park Chanjun
Seo Jaehyung
Publication venue
Publication date: 30/09/2022
Field of study

With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M

arXiv.org e-Print Archive