Search CORE

9 research outputs found

Korean Language Resources for Everyone

Author: Cha Jeong-Won
Hong Jeen-Pyo
Park Jungyeul
Publication venue: Hankookmunhwasa
Publication date: 01/01/2016
Field of study

Kosp2e: Korean Speech to English Translation Corpus

Author: Cho Hyunchang
Cho Won Ik
Kim Nam Soo
Kim Seok Min
Publication venue
Publication date: 06/07/2021
Field of study

Most speech-to-text (S2T) translation studies use English speech as a source, which makes it difficult for non-English speakers to take advantage of the S2T technologies. For some languages, this problem was tackled through corpus construction, but the farther linguistically from English or the more under-resourced, this deficiency and underrepresentedness becomes more significant. In this paper, we introduce kosp2e (read as `kospi'), a corpus that allows Korean speech to be translated into English text in an end-to-end manner. We adopt open license speech recognition corpus, translation corpus, and spoken language corpora to make our dataset freely available to the public, and check the performance through the pipeline and training-based approaches. Using pipeline and various end-to-end schemes, we obtain the highest BLEU of 21.3 and 18.0 for each based on the English hypothesis, validating the feasibility of our data. We plan to supplement annotations for other target languages through community contributions in the future.Comment: Interspeech 2021 Camera-read

arXiv.org e-Print Archive

SNU Open Repository and Archive

A Visual Analytics System for evaluating dataset of Neural Machine Translation

Author: 박세범
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2023. 2. 서진욱.Neural Machine Translation (신경망을 이용한 기계 번역) 모델을 학습시키는데 있어서 가장 영향을 많이 끼치는 요소는 학습 데이터인 병렬 말뭉치(Parallel Corpora)의 품질이다. 따라서 병렬 말뭉치의 품질 개선이 필수적이며 지금까지 다양한 정제(Refinement) 작업이 많이 도입되었으나 여전히 개선할 부분이 많다. 이 논문은 기계 번역 학습시 필요한 병렬 말뭉치의 품질 개선 작업에 도움이 될 수 있는 시각적 분석 시스템을 소개한다. 우리 시스템은 병렬 말뭉치의 Noise를 빠르게 발견하고 선별하기 위해 머신러닝 기술을 활용하여 다양한 지표 (Metric)를 추출하고 이를 기반으로 상호작용이 가능한 시각적 분석 기법을 제공한다. 사용자는 우리의 시스템을 통해 Noise Data를 손쉽게 파악하고 이에 대한 상세한 내용을 확인 후 제거가 가능하다. 본 시스템의 효율성 및 유용함을 증명하기 위해 4명의 전문가를 포함한 총 8명의 사용자에게 사용성 평가를 진행하였으며, 마지막에 평가 결과를 바탕으로 개선해야 할 점에 대한 논의점도 언급한다.The most important part of training a Neural Machine Translation model maintains good quality of parallel corpora, which are composed of pairs of different languages, Therefore, various refinement tasks have been introduced to improve the quality of parallel corpora, but there is still much room for improvement. This paper introduces a visual analysis system which helps the good quality of parallel corpora for machine translation learning. Our system provides nine different metrics in order to discover and select noise of parallel corpora. Based on our metric and visualization technics, users can find and check noise parallel corpora easily. Our systems effectiveness and usefulness are demonstrated through a qualitative user study with a total of eight users including four experts.제 1 장 서 론 1 제 2 장 관련연구 4 제 3 장 디자인 요구사항 7 제 4 장 데이터 전처리 과정 10 제 5 장 시각화 디자인 14 제 1 절 Distribution View 14 제 2 절 Ranking View 15 제 3 절 Text Compare View 18 제 4 절 Ruleset View 20 제 6 장 사용성 평가 22 제 1 절 결과 23 제 2 절 사후 인터뷰 25 제 7 장 논 의 28 제 8 장 결 론 31 참고문헌 32 Abstract 36석

SNU Open Repository and Archive

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

Author: Chu Chenhui
Kurohashi Sadao
Mao Zhuoyuan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/01/2022
Field of study

In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese–English & Japanese–Chinese, Wikipedia Japanese–Chinese, News English–Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese–English tasks, up to +7.0 BLEU points for the Japanese–Chinese tasks and up to +1.3 BLEU points for English–Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency

arXiv.org e-Print Archive

Kyoto University Research Information Repository

Task Composition with Adapter Module Using Cross Lingual Alignment from English to Korean

Author: 신재열
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (석사) -- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2021. 2. 이상구.최근 BERT 와 같은 Transformer 기반의 선수 학습된 언어 모델 (pre-trained language model) 은 다양한 자연어처리 (natural language processing) 분야에서 높은 성능을 보이고 있다. 하지만 이러한 높은 성능의 언어 모델의 등장에도 불구하고, 상대적으로 작은 데이터셋에 대한 성능은 아직 개선의 여지가 많다. 그 개선 방법 중 태스크 조합 (task composition) 방법은 여러 태스크에서 학습된 지식을 목표 태스크에 전이 학습 (transfer learning) 하여 해당 목표 태스크의 성능을 향상시키는 것에 효과적이다. 본 연구는 이러한 배경에서 많은 리소스 환경인 (high-resource) 영어 태스크에 학습된 어댑터 (Adapter) 네트워크를 조합하여 비교적 적은 리소스 환경인 (low-resource) 한국어 태스크에 대해 태스크 조합을 통해 성능을 향상시키고자 한다. 이때, 다국어 선수 학습 언어 모델 (multilingual pre-trained language model)에서 영어와 한국어 사이의 은닉 표현 벡터 (hidden representation vector) 분포 차이로 인한 문제를 해결하기 위해 평균 차이 이동 (mean difference shift)과 회전 변환 (rotational transform) 방법을 통해 영어의 은닉 표현을 한국어의 분포로 근사한다. 이러한 제안된 방법론을 통해 KorSTS, KorNLI, NSMC 와 같은 한국어 데이터셋에 대한 유의미한 성능 개선을 보고하였다.Recently, Transformer-based Pre-trained Language Models (PLM) such as BERT have shown high performance in various natural language processing (NLP) fields. However, despite the advent of such a language model with high performance, there is still room for improvement in performance for relatively small datasets. Among the proposed methods dealing with this problem, the task composition method is effective in improving the performance of the target task by transferring the knowledge learned in several tasks. In this background, this study uses the task composition method by combining Adapter networks learned in high-resource language, English, in order to improve performance of Korean tasks which are in relatively lower-resource setting. At this time, the mean difference shift (MDS) and rotational transform method are applied to approximate hidden representations of English to Korean to solve the problem caused by the difference between the distribution of hidden representation vector between English and Korean. Through these proposed methodologies, we have reported reasonable performance improvements for Korean datasets such as KorSTS, KorNLI, and NSMC.초 록 III 목 차 V 표 목차 VII 그림 목차 VIII 제 1 장 서 론 1 제 1 절 연구의 배경 1 제 2 절 연구의 범위와 내용 6 제 3 절 논문의 구성 9 제 2 장 관련 연구 10 제 1 절 선수 학습된 언어 모델 및 다국어 언어 모델 10 제 2 절 어댑터 네트워크 16 제 3 절 태스크 조합 방법 19 제 4 절 은닉 표현 분포 차이 문제와 정렬 방법 22 제 3 장 모델 설명 30 제 1 절 태스크 조합 모델 아키텍처 30 제 2 절 은닉 표현 정렬 적용 방법 32 제 4 장 실 험 39 제 1 절 데이터셋 39 제 2 절 학습 방법 44 제 3 절 실험 결과 45 제 5 장 결 론 55 제 1 절 결론 및 고찰 55 제 2 절 향후 연구 56 참고 문헌 58 ABSTRACT 62Maste

SNU Open Repository and Archive

Proceedings of the 13th Linguistic Annotation Workshop, August 1, 2019, Florence, Italy

Author: Friedrich Annemarie
Hoek Jet
Zeyrek Deniz
Publication venue
Publication date: 07/07/2023
Field of study

OPUS Augsburg

Korean Language Resources for Everyone

Author: Cha Jeong-Won
Hong Jeen-Pyo
Park Jungyeul
Publication venue: Hankookmunhwasa
Publication date: 12/02/2019
Field of study

Institutional Repositories DataBase (IRDB)

JHE Korean-English evaluation data

Author: Park Jungyeul (5365918)
Publication venue
Publication date
Field of study

Junior High English evaluation data for Korean-English machine translation (JHE). Jungyeul Park, Jeen-Pyo Hong, and Jeong-Won Cha (2016). Korean Language Resources for Everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30). Seoul, Korea. [pdf] @inproceedings{park-hong-cha:2016:PACLIC, address = {Seoul, Korea}, author = {Park, Jungyeul and Hong, Jeen-Pyo and Cha, Jeong-Won}, booktitle = {Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)}, pages = {49--58}, title = {{Korean Language Resources for Everyone}}, year = {2016} }</p

FigShare

MaltParser model for Korean: Sejong treebank

Author: Jungyeul Park (5182154)
Publication venue
Publication date
Field of study

MaltParser model for Korean: Sejong treebank Jungyeul Park, Jeen-Pyo Hong, and Jeong-Won Cha (2016). Korean Language Resources for Everyone. In Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30). Seoul, Korea. [pdf] @inproceedings{park-hong-cha:2016:PACLIC, address = {Seoul, Korea}, author = {Park, Jungyeul and Hong, Jeen-Pyo and Cha, Jeong-Won}, booktitle = {Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation (PACLIC 30)}, pages = {49--58}, title = {{Korean Language Resources for Everyone}}, year = {2016} } It requires Espresso's POS tagging results for input. Espresso is available at https://zenodo.org/record/884606 </p

FigShare