9 research outputs found

    Korean Language Resources for Everyone

    Get PDF

    Kosp2e: Korean Speech to English Translation Corpus

    Full text link
    Most speech-to-text (S2T) translation studies use English speech as a source, which makes it difficult for non-English speakers to take advantage of the S2T technologies. For some languages, this problem was tackled through corpus construction, but the farther linguistically from English or the more under-resourced, this deficiency and underrepresentedness becomes more significant. In this paper, we introduce kosp2e (read as `kospi'), a corpus that allows Korean speech to be translated into English text in an end-to-end manner. We adopt open license speech recognition corpus, translation corpus, and spoken language corpora to make our dataset freely available to the public, and check the performance through the pipeline and training-based approaches. Using pipeline and various end-to-end schemes, we obtain the highest BLEU of 21.3 and 18.0 for each based on the English hypothesis, validating the feasibility of our data. We plan to supplement annotations for other target languages through community contributions in the future.Comment: Interspeech 2021 Camera-read

    A Visual Analytics System for evaluating dataset of Neural Machine Translation

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2023. 2. ์„œ์ง„์šฑ.Neural Machine Translation (์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ) ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ์žˆ์–ด์„œ ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ผ์น˜๋Š” ์š”์†Œ๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์ธ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜(Parallel Corpora)์˜ ํ’ˆ์งˆ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์˜ ํ’ˆ์งˆ ๊ฐœ์„ ์ด ํ•„์ˆ˜์ ์ด๋ฉฐ ์ง€๊ธˆ๊นŒ์ง€ ๋‹ค์–‘ํ•œ ์ •์ œ(Refinement) ์ž‘์—…์ด ๋งŽ์ด ๋„์ž…๋˜์—ˆ์œผ๋‚˜ ์—ฌ์ „ํžˆ ๊ฐœ์„ ํ•  ๋ถ€๋ถ„์ด ๋งŽ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ ํ•™์Šต์‹œ ํ•„์š”ํ•œ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์˜ ํ’ˆ์งˆ ๊ฐœ์„  ์ž‘์—…์— ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋Š” ์‹œ๊ฐ์  ๋ถ„์„ ์‹œ์Šคํ…œ์„ ์†Œ๊ฐœํ•œ๋‹ค. ์šฐ๋ฆฌ ์‹œ์Šคํ…œ์€ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜์˜ Noise๋ฅผ ๋น ๋ฅด๊ฒŒ ๋ฐœ๊ฒฌํ•˜๊ณ  ์„ ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ง€ํ‘œ (Metric)๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒํ˜ธ์ž‘์šฉ์ด ๊ฐ€๋Šฅํ•œ ์‹œ๊ฐ์  ๋ถ„์„ ๊ธฐ๋ฒ•์„ ์ œ๊ณตํ•œ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ์šฐ๋ฆฌ์˜ ์‹œ์Šคํ…œ์„ ํ†ตํ•ด Noise Data๋ฅผ ์†์‰ฝ๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ์ด์— ๋Œ€ํ•œ ์ƒ์„ธํ•œ ๋‚ด์šฉ์„ ํ™•์ธ ํ›„ ์ œ๊ฑฐ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ณธ ์‹œ์Šคํ…œ์˜ ํšจ์œจ์„ฑ ๋ฐ ์œ ์šฉํ•จ์„ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด 4๋ช…์˜ ์ „๋ฌธ๊ฐ€๋ฅผ ํฌํ•จํ•œ ์ด 8๋ช…์˜ ์‚ฌ์šฉ์ž์—๊ฒŒ ์‚ฌ์šฉ์„ฑ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ, ๋งˆ์ง€๋ง‰์— ํ‰๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๊ฐœ์„ ํ•ด์•ผ ํ•  ์ ์— ๋Œ€ํ•œ ๋…ผ์˜์ ๋„ ์–ธ๊ธ‰ํ•œ๋‹ค.The most important part of training a Neural Machine Translation model maintains good quality of parallel corpora, which are composed of pairs of different languages, Therefore, various refinement tasks have been introduced to improve the quality of parallel corpora, but there is still much room for improvement. This paper introduces a visual analysis system which helps the good quality of parallel corpora for machine translation learning. Our system provides nine different metrics in order to discover and select noise of parallel corpora. Based on our metric and visualization technics, users can find and check noise parallel corpora easily. Our systems effectiveness and usefulness are demonstrated through a qualitative user study with a total of eight users including four experts.์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 2 ์žฅ ๊ด€๋ จ์—ฐ๊ตฌ 4 ์ œ 3 ์žฅ ๋””์ž์ธ ์š”๊ตฌ์‚ฌํ•ญ 7 ์ œ 4 ์žฅ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • 10 ์ œ 5 ์žฅ ์‹œ๊ฐํ™” ๋””์ž์ธ 14 ์ œ 1 ์ ˆ Distribution View 14 ์ œ 2 ์ ˆ Ranking View 15 ์ œ 3 ์ ˆ Text Compare View 18 ์ œ 4 ์ ˆ Ruleset View 20 ์ œ 6 ์žฅ ์‚ฌ์šฉ์„ฑ ํ‰๊ฐ€ 22 ์ œ 1 ์ ˆ ๊ฒฐ๊ณผ 23 ์ œ 2 ์ ˆ ์‚ฌํ›„ ์ธํ„ฐ๋ทฐ 25 ์ œ 7 ์žฅ ๋…ผ ์˜ 28 ์ œ 8 ์žฅ ๊ฒฐ ๋ก  31 ์ฐธ๊ณ ๋ฌธํ—Œ 32 Abstract 36์„

    Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

    Get PDF
    In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japaneseโ€“English & Japaneseโ€“Chinese, Wikipedia Japaneseโ€“Chinese, News Englishโ€“Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japaneseโ€“English tasks, up to +7.0 BLEU points for the Japaneseโ€“Chinese tasks and up to +1.3 BLEU points for Englishโ€“Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency

    Task Composition with Adapter Module Using Cross Lingual Alignment from English to Korean

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ์ด์ƒ๊ตฌ.์ตœ๊ทผ BERT ์™€ ๊ฐ™์€ Transformer ๊ธฐ๋ฐ˜์˜ ์„ ์ˆ˜ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ (pre-trained language model) ์€ ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ (natural language processing) ๋ถ„์•ผ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋†’์€ ์„ฑ๋Šฅ์˜ ์–ธ์–ด ๋ชจ๋ธ์˜ ๋“ฑ์žฅ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์€ ์•„์ง ๊ฐœ์„ ์˜ ์—ฌ์ง€๊ฐ€ ๋งŽ๋‹ค. ๊ทธ ๊ฐœ์„  ๋ฐฉ๋ฒ• ์ค‘ ํƒœ์Šคํฌ ์กฐํ•ฉ (task composition) ๋ฐฉ๋ฒ•์€ ์—ฌ๋Ÿฌ ํƒœ์Šคํฌ์—์„œ ํ•™์Šต๋œ ์ง€์‹์„ ๋ชฉํ‘œ ํƒœ์Šคํฌ์— ์ „์ด ํ•™์Šต (transfer learning) ํ•˜์—ฌ ํ•ด๋‹น ๋ชฉํ‘œ ํƒœ์Šคํฌ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์— ํšจ๊ณผ์ ์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด๋Ÿฌํ•œ ๋ฐฐ๊ฒฝ์—์„œ ๋งŽ์€ ๋ฆฌ์†Œ์Šค ํ™˜๊ฒฝ์ธ (high-resource) ์˜์–ด ํƒœ์Šคํฌ์— ํ•™์Šต๋œ ์–ด๋Œ‘ํ„ฐ (Adapter) ๋„คํŠธ์›Œํฌ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ๋น„๊ต์  ์ ์€ ๋ฆฌ์†Œ์Šค ํ™˜๊ฒฝ์ธ (low-resource) ํ•œ๊ตญ์–ด ํƒœ์Šคํฌ์— ๋Œ€ํ•ด ํƒœ์Šคํฌ ์กฐํ•ฉ์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ ์ž ํ•œ๋‹ค. ์ด๋•Œ, ๋‹ค๊ตญ์–ด ์„ ์ˆ˜ ํ•™์Šต ์–ธ์–ด ๋ชจ๋ธ (multilingual pre-trained language model)์—์„œ ์˜์–ด์™€ ํ•œ๊ตญ์–ด ์‚ฌ์ด์˜ ์€๋‹‰ ํ‘œํ˜„ ๋ฒกํ„ฐ (hidden representation vector) ๋ถ„ํฌ ์ฐจ์ด๋กœ ์ธํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ‰๊ท  ์ฐจ์ด ์ด๋™ (mean difference shift)๊ณผ ํšŒ์ „ ๋ณ€ํ™˜ (rotational transform) ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์˜์–ด์˜ ์€๋‹‰ ํ‘œํ˜„์„ ํ•œ๊ตญ์–ด์˜ ๋ถ„ํฌ๋กœ ๊ทผ์‚ฌํ•œ๋‹ค. ์ด๋Ÿฌํ•œ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด KorSTS, KorNLI, NSMC ์™€ ๊ฐ™์€ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ ์„ ๋ณด๊ณ ํ•˜์˜€๋‹ค.Recently, Transformer-based Pre-trained Language Models (PLM) such as BERT have shown high performance in various natural language processing (NLP) fields. However, despite the advent of such a language model with high performance, there is still room for improvement in performance for relatively small datasets. Among the proposed methods dealing with this problem, the task composition method is effective in improving the performance of the target task by transferring the knowledge learned in several tasks. In this background, this study uses the task composition method by combining Adapter networks learned in high-resource language, English, in order to improve performance of Korean tasks which are in relatively lower-resource setting. At this time, the mean difference shift (MDS) and rotational transform method are applied to approximate hidden representations of English to Korean to solve the problem caused by the difference between the distribution of hidden representation vector between English and Korean. Through these proposed methodologies, we have reported reasonable performance improvements for Korean datasets such as KorSTS, KorNLI, and NSMC.์ดˆ ๋ก III ๋ชฉ ์ฐจ V ํ‘œ ๋ชฉ์ฐจ VII ๊ทธ๋ฆผ ๋ชฉ์ฐจ VIII ์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ 1 ์ œ 2 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ฒ”์œ„์™€ ๋‚ด์šฉ 6 ์ œ 3 ์ ˆ ๋…ผ๋ฌธ์˜ ๊ตฌ์„ฑ 9 ์ œ 2 ์žฅ ๊ด€๋ จ ์—ฐ๊ตฌ 10 ์ œ 1 ์ ˆ ์„ ์ˆ˜ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ ๋ฐ ๋‹ค๊ตญ์–ด ์–ธ์–ด ๋ชจ๋ธ 10 ์ œ 2 ์ ˆ ์–ด๋Œ‘ํ„ฐ ๋„คํŠธ์›Œํฌ 16 ์ œ 3 ์ ˆ ํƒœ์Šคํฌ ์กฐํ•ฉ ๋ฐฉ๋ฒ• 19 ์ œ 4 ์ ˆ ์€๋‹‰ ํ‘œํ˜„ ๋ถ„ํฌ ์ฐจ์ด ๋ฌธ์ œ์™€ ์ •๋ ฌ ๋ฐฉ๋ฒ• 22 ์ œ 3 ์žฅ ๋ชจ๋ธ ์„ค๋ช… 30 ์ œ 1 ์ ˆ ํƒœ์Šคํฌ ์กฐํ•ฉ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ 30 ์ œ 2 ์ ˆ ์€๋‹‰ ํ‘œํ˜„ ์ •๋ ฌ ์ ์šฉ ๋ฐฉ๋ฒ• 32 ์ œ 4 ์žฅ ์‹ค ํ—˜ 39 ์ œ 1 ์ ˆ ๋ฐ์ดํ„ฐ์…‹ 39 ์ œ 2 ์ ˆ ํ•™์Šต ๋ฐฉ๋ฒ• 44 ์ œ 3 ์ ˆ ์‹คํ—˜ ๊ฒฐ๊ณผ 45 ์ œ 5 ์žฅ ๊ฒฐ ๋ก  55 ์ œ 1 ์ ˆ ๊ฒฐ๋ก  ๋ฐ ๊ณ ์ฐฐ 55 ์ œ 2 ์ ˆ ํ–ฅํ›„ ์—ฐ๊ตฌ 56 ์ฐธ๊ณ  ๋ฌธํ—Œ 58 ABSTRACT 62Maste

    JHE Korean-English evaluation data

    No full text
    <p>Junior High English evaluation data for Korean-English machine translation (JHE).</p> <p>Jungyeul Park, Jeen-Pyo Hong, and Jeong-Won Cha (2016). Korean Language Resources for Everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30). Seoul, Korea. [pdf]</p> <p>@inproceedings{park-hong-cha:2016:PACLIC,ย <br> address = {Seoul, Korea},ย <br> author = {Park, Jungyeul and Hong, Jeen-Pyo and Cha, Jeong-Won},ย <br> booktitle = {Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)},ย <br> pages = {49--58},ย <br> title = {{Korean Language Resources for Everyone}},ย <br> year = {2016}ย <br> }</p

    MaltParser model for Korean: Sejong treebank

    No full text
    <p>MaltParser model for Korean: ย Sejong treebank</p> <p>Jungyeul Park, Jeen-Pyo Hong, and Jeong-Won Cha (2016). Korean Language Resources for Everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30). Seoul, Korea. [pdf]</p> <p>@inproceedings{park-hong-cha:2016:PACLIC,ย <br> address = {Seoul, Korea},ย <br> author = {Park, Jungyeul and Hong, Jeen-Pyo and Cha, Jeong-Won},ย <br> booktitle = {Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)},ย <br> pages = {49--58},ย <br> title = {{Korean Language Resources for Everyone}},ย <br> year = {2016}ย <br> }</p> <p>It requires Espresso's POS tagging results for input. Espresso is available at https://zenodo.org/record/884606ย </p
    corecore