25 research outputs found

    Data-Driven Approach for Formality-Sensitive Machine Translation: Language-Specific Handling and Synthetic Data Generation

    Full text link
    In this paper, we introduce a data-driven approach for Formality-Sensitive Machine Translation (FSMT) that caters to the unique linguistic properties of four target languages. Our methodology centers on two core strategies: 1) language-specific data handling, and 2) synthetic data generation using large-scale language models and empirical prompt engineering. This approach demonstrates a considerable improvement over the baseline, highlighting the effectiveness of data-centric techniques. Our prompt engineering strategy further improves performance by producing superior synthetic translation examples.Comment: Accepted for Data-centric Machine Learning Research (DMLR) Workshop at ICML 202

    A Self-Supervised Automatic Post-Editing Data Generation Tool

    Full text link
    Data building for automatic post-editing (APE) requires extensive and expert-level human effort, as it contains an elaborate process that involves identifying errors in sentences and providing suitable revisions. Hence, we develop a self-supervised data generation tool, deployable as a web application, that minimizes human supervision and constructs personalized APE data from a parallel corpus for several language pairs with English as the target language. Data-centric APE research can be conducted using this tool, involving many language pairs that have not been studied thus far owing to the lack of suitable data.Comment: Accepted for DataPerf workshop at ICML 202

    QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

    Full text link
    With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M

    Effect of KH-BaRoKer-SeongJangTang based on traditional medicine theory on longitudinal bone growth

    Get PDF
    ABSTRACT KH-BaRoKer-SeongJangTang (KBS) is a recently developed formulation by using traditional drugs considering traditional medical theory of Oriental books such as ShinNongBonChoGyeong and JuRye, which has been used to improve the growth of child in Korea. Although KBS is usually prescribed to many children who are in retard for their age, its pharmacological effects have not been fully understood in experimental models. The aim of this study was to evaluate the effects of KBS on bone growth. Growth plate thickness and bone parameters such as bone volume/tissue volume (BV/TV), trabecular thickness (Tb.Th), trabecular number (Tb.N), connection density (Conn.D), and total porosity were analyzed by means of microcomputed tomography. Serum insulin-like growth factor-I (IGF-I) levels were measured by enzyme-linked immunosorbent assay. Hepatic IGF-I mRNA expression was analyzed by real-time polymerase chain reaction. Phosphorylation of signal transducer and activator of transcription5 (STAT5) was investigated using Western blot analysis and immunohistochemistry. The thickness of growth plate was increased by KBS. BV/TV, Tb.Th, TbN, Conn.D, and total porosity were improved by KBS. Hepatic IGF-I mRNA and serum IGF-I levels were elevated by KBS. Phosphorylation of STAT5 was increased with administration of KBS. These results suggest that KBS would be helpful to children who are in retard for their age through the elevation of IGF-I

    The Impact of Korean Medicine Treatment on the Incidence of Parkinson's Disease in Patients with Inflammatory Bowel Disease: A Nationwide Population-Based Cohort Study in South Korea

    Get PDF
    We aimed to investigate the association between Korean medicine (KM) treatment and the risk of Parkinson's Disease (PD) in patients with inflammatory bowel disease (IBD) in South Korea. This study analyzed data from the National Health Insurance Service-Senior cohort in South Korea. The 1816 IBD patients enrolled in the analysis comprised 411 who received only conventional treatment (monotherapy group) and 1405 who received both conventional and KM treatments (integrative therapy group). The risk of PD in patients with IBD was significantly lower in the integrative therapy group than in the monotherapy group after adjusting for confounding variables (adjusted hazard ratio (HR), 0.56; 95% confidence interval (CI) = 0.34-0.92). In the mild Charlson Comorbidity Index (CCI) group, the risk of PD in patients with IBD in the integrative therapy group was 0.39 times lower (adjusted HR, 95% CI = 0.20-0.77) than that in the monotherapy group. However, there was no significant difference in the risk of PD in patients with IBD between the integrative therapy and monotherapy groups among individuals with severe CCI (adjusted HR, 0.90; 95% CI = 0.41-1.96). IBD patients are at a decreased risk of PD when they receive integrative therapy. KM treatment may prevent PD in IBD patients.Y

    Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation

    No full text
    Quality estimation (QE) has recently gained increasing interest as it can predict the quality of machine translation results without a reference translation. QE is an annual shared task at the Conference on Machine Translation (WMT), and most recent studies have applied the multilingual pretrained language model (mPLM) to address this task. Recent studies have focused on the performance improvement of this task using data augmentation with finetuning based on a large-scale mPLM. In this study, we eliminate the effects of data augmentation and conduct a pure performance comparison between various mPLMs. Separate from the recent performance-driven QE research involved in competitions addressing a shared task, we utilize the comparison for sub-tasks from WMT20 and identify an optimal mPLM. Moreover, we demonstrate QE using the multilingual BART model, which has not yet been utilized, and conduct comparative experiments and analyses with cross-lingual language models (XLMs), multilingual BERT, and XLM-RoBERTa

    BERTOEIC: Solving TOEIC Problems Using Simple and Efficient Data Augmentation Techniques with Pretrained Transformer Encoders

    No full text
    Recent studies have attempted to understand natural language and infer answers. Machine reading comprehension is one of the representatives, and several related datasets have been opened. However, there are few official open datasets for the Test of English for International Communication (TOEIC), which is widely used for evaluating people’s English proficiency, and research for further advancement is not being actively conducted. We consider that the reason why deep learning research for TOEIC is difficult is due to the data scarcity problem, so we therefore propose two data augmentation methods to improve the model in a low resource environment. Considering the attributes of the semantic and grammar problem type in TOEIC, the proposed methods can augment the data similar to the real TOEIC problem by using POS-tagging and Lemmatizing. In addition, we confirmed the importance of understanding semantics and grammar in TOEIC through experiments on each proposed methodology and experiments according to the amount of data. The proposed methods address the data shortage problem of TOEIC and enable an acceptable human-level performance

    Empirical Analysis of Parallel Corpora and In-Depth Analysis Using LIWC

    No full text
    The machine translation system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation. One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance
    corecore