25 research outputs found
Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation
In this paper, we introduce a data-driven approach for Formality-Sensitive
Machine Translation (FSMT) that caters to the unique linguistic properties of
four target languages. Our methodology centers on two core strategies: 1)
language-specific data handling, and 2) synthetic data generation using
large-scale language models and empirical prompt engineering. This approach
demonstrates a considerable improvement over the baseline, highlighting the
effectiveness of data-centric techniques. Our prompt engineering strategy
further improves performance by producing superior synthetic translation
examples.Comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop
at ICML 202
A Self-Supervised Automatic Post-Editing Data Generation Tool
Data building for automatic post-editing (APE) requires extensive and
expert-level human effort, as it contains an elaborate process that involves
identifying errors in sentences and providing suitable revisions. Hence, we
develop a self-supervised data generation tool, deployable as a web
application, that minimizes human supervision and constructs personalized APE
data from a parallel corpus for several language pairs with English as the
target language. Data-centric APE research can be conducted using this tool,
involving many language pairs that have not been studied thus far owing to the
lack of suitable data.Comment: Accepted for DataPerf workshop at ICML 202
QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation
With the recent advance in neural machine translation demonstrating its
importance, research on quality estimation (QE) has been steadily progressing.
QE aims to automatically predict the quality of machine translation (MT) output
without reference sentences. Despite its high utility in the real world, there
remain several limitations concerning manual QE data creation: inevitably
incurred non-trivial costs due to the need for translation experts, and issues
with data scaling and language expansion. To tackle these limitations, we
present QUAK, a Korean-English synthetic QE dataset generated in a fully
automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and
QUAK-H, produced through three strategies that are relatively free from
language constraints. Since each strategy requires no human effort, which
facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M
for QUAK-M. As an experiment, we quantitatively analyze word-level QE results
in various ways while performing statistical analysis. Moreover, we show that
datasets scaled in an efficient way also contribute to performance improvements
by observing meaningful performance gains in QUAK-M, P when adding data up to
1.58M
Effect of KH-BaRoKer-SeongJangTang based on traditional medicine theory on longitudinal bone growth
ABSTRACT KH-BaRoKer-SeongJangTang (KBS) is a recently developed formulation by using traditional drugs considering traditional medical theory of Oriental books such as ShinNongBonChoGyeong and JuRye, which has been used to improve the growth of child in Korea. Although KBS is usually prescribed to many children who are in retard for their age, its pharmacological effects have not been fully understood in experimental models. The aim of this study was to evaluate the effects of KBS on bone growth. Growth plate thickness and bone parameters such as bone volume/tissue volume (BV/TV), trabecular thickness (Tb.Th), trabecular number (Tb.N), connection density (Conn.D), and total porosity were analyzed by means of microcomputed tomography. Serum insulin-like growth factor-I (IGF-I) levels were measured by enzyme-linked immunosorbent assay. Hepatic IGF-I mRNA expression was analyzed by real-time polymerase chain reaction. Phosphorylation of signal transducer and activator of transcription5 (STAT5) was investigated using Western blot analysis and immunohistochemistry. The thickness of growth plate was increased by KBS. BV/TV, Tb.Th, TbN, Conn.D, and total porosity were improved by KBS. Hepatic IGF-I mRNA and serum IGF-I levels were elevated by KBS. Phosphorylation of STAT5 was increased with administration of KBS. These results suggest that KBS would be helpful to children who are in retard for their age through the elevation of IGF-I
The Impact of Korean Medicine Treatment on the Incidence of Parkinson's Disease in Patients with Inflammatory Bowel Disease: A Nationwide Population-Based Cohort Study in South Korea
We aimed to investigate the association between Korean medicine (KM) treatment and the risk of Parkinson's Disease (PD) in patients with inflammatory bowel disease (IBD) in South Korea. This study analyzed data from the National Health Insurance Service-Senior cohort in South Korea. The 1816 IBD patients enrolled in the analysis comprised 411 who received only conventional treatment (monotherapy group) and 1405 who received both conventional and KM treatments (integrative therapy group). The risk of PD in patients with IBD was significantly lower in the integrative therapy group than in the monotherapy group after adjusting for confounding variables (adjusted hazard ratio (HR), 0.56; 95% confidence interval (CI) = 0.34-0.92). In the mild Charlson Comorbidity Index (CCI) group, the risk of PD in patients with IBD in the integrative therapy group was 0.39 times lower (adjusted HR, 95% CI = 0.20-0.77) than that in the monotherapy group. However, there was no significant difference in the risk of PD in patients with IBD between the integrative therapy and monotherapy groups among individuals with severe CCI (adjusted HR, 0.90; 95% CI = 0.41-1.96). IBD patients are at a decreased risk of PD when they receive integrative therapy. KM treatment may prevent PD in IBD patients.Y
Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation
Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with finetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa
BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders
Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance
Mimicking Infants' Bilingual Language Acquisition for Domain Specialized Neural Machine Translation
Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC
The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance