6 research outputs found

    A Visual Analytics System for evaluating dataset of Neural Machine Translation

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2023. 2. μ„œμ§„μš±.Neural Machine Translation (신경망을 μ΄μš©ν•œ 기계 λ²ˆμ—­) λͺ¨λΈμ„ ν•™μŠ΅μ‹œν‚€λŠ”λ° μžˆμ–΄μ„œ κ°€μž₯ 영ν–₯을 많이 λΌμΉ˜λŠ” μš”μ†ŒλŠ” ν•™μŠ΅ 데이터인 병렬 λ§λ­‰μΉ˜(Parallel Corpora)의 ν’ˆμ§ˆμ΄λ‹€. λ”°λΌμ„œ 병렬 λ§λ­‰μΉ˜μ˜ ν’ˆμ§ˆ κ°œμ„ μ΄ ν•„μˆ˜μ μ΄λ©° μ§€κΈˆκΉŒμ§€ λ‹€μ–‘ν•œ μ •μ œ(Refinement) μž‘μ—…μ΄ 많이 λ„μž…λ˜μ—ˆμœΌλ‚˜ μ—¬μ „νžˆ κ°œμ„ ν•  뢀뢄이 λ§Žλ‹€. 이 논문은 기계 λ²ˆμ—­ ν•™μŠ΅μ‹œ ν•„μš”ν•œ 병렬 λ§λ­‰μΉ˜μ˜ ν’ˆμ§ˆ κ°œμ„  μž‘μ—…μ— 도움이 될 수 μžˆλŠ” μ‹œκ°μ  뢄석 μ‹œμŠ€ν…œμ„ μ†Œκ°œν•œλ‹€. 우리 μ‹œμŠ€ν…œμ€ 병렬 λ§λ­‰μΉ˜μ˜ Noiseλ₯Ό λΉ λ₯΄κ²Œ λ°œκ²¬ν•˜κ³  μ„ λ³„ν•˜κΈ° μœ„ν•΄ λ¨Έμ‹ λŸ¬λ‹ κΈ°μˆ μ„ ν™œμš©ν•˜μ—¬ λ‹€μ–‘ν•œ μ§€ν‘œ (Metric)λ₯Ό μΆ”μΆœν•˜κ³  이λ₯Ό 기반으둜 μƒν˜Έμž‘μš©μ΄ κ°€λŠ₯ν•œ μ‹œκ°μ  뢄석 기법을 μ œκ³΅ν•œλ‹€. μ‚¬μš©μžλŠ” 우리의 μ‹œμŠ€ν…œμ„ 톡해 Noise Dataλ₯Ό μ†μ‰½κ²Œ νŒŒμ•…ν•˜κ³  이에 λŒ€ν•œ μƒμ„Έν•œ λ‚΄μš©μ„ 확인 ν›„ μ œκ±°κ°€ κ°€λŠ₯ν•˜λ‹€. λ³Έ μ‹œμŠ€ν…œμ˜ νš¨μœ¨μ„± 및 μœ μš©ν•¨μ„ 증λͺ…ν•˜κΈ° μœ„ν•΄ 4λͺ…μ˜ μ „λ¬Έκ°€λ₯Ό ν¬ν•¨ν•œ 총 8λͺ…μ˜ μ‚¬μš©μžμ—κ²Œ μ‚¬μš©μ„± 평가λ₯Ό μ§„ν–‰ν•˜μ˜€μœΌλ©°, λ§ˆμ§€λ§‰μ— 평가 κ²°κ³Όλ₯Ό λ°”νƒ•μœΌλ‘œ κ°œμ„ ν•΄μ•Ό ν•  점에 λŒ€ν•œ λ…Όμ˜μ λ„ μ–ΈκΈ‰ν•œλ‹€.The most important part of training a Neural Machine Translation model maintains good quality of parallel corpora, which are composed of pairs of different languages, Therefore, various refinement tasks have been introduced to improve the quality of parallel corpora, but there is still much room for improvement. This paper introduces a visual analysis system which helps the good quality of parallel corpora for machine translation learning. Our system provides nine different metrics in order to discover and select noise of parallel corpora. Based on our metric and visualization technics, users can find and check noise parallel corpora easily. Our systems effectiveness and usefulness are demonstrated through a qualitative user study with a total of eight users including four experts.제 1 μž₯ μ„œ λ‘  1 제 2 μž₯ 관련연ꡬ 4 제 3 μž₯ λ””μžμΈ μš”κ΅¬μ‚¬ν•­ 7 제 4 μž₯ 데이터 μ „μ²˜λ¦¬ κ³Όμ • 10 제 5 μž₯ μ‹œκ°ν™” λ””μžμΈ 14 제 1 절 Distribution View 14 제 2 절 Ranking View 15 제 3 절 Text Compare View 18 제 4 절 Ruleset View 20 제 6 μž₯ μ‚¬μš©μ„± 평가 22 제 1 절 κ²°κ³Ό 23 제 2 절 사후 인터뷰 25 제 7 μž₯ λ…Ό 의 28 제 8 μž₯ κ²° λ‘  31 μ°Έκ³ λ¬Έν—Œ 32 Abstract 36석

    Machine translation of user-generated content

    Get PDF
    The world of social media has undergone huge evolution during the last few years. With the spread of social media and online forums, individual users actively participate in the generation of online content in different languages from all over the world. Sharing of online content has become much easier than before with the advent of popular websites such as Twitter, Facebook etc. Such content is referred to as β€˜User-Generated Content’ (UGC). Some examples of UGC are user reviews, customer feedback, tweets etc. In general, UGC is informal and noisy in terms of linguistic norms. Such noise does not create significant problems for human to understand the content, but it can pose challenges for several natural language processing applications such as parsing, sentiment analysis, machine translation (MT), etc. An additional challenge for MT is sparseness of bilingual (translated) parallel UGC corpora. In this research, we explore the general issues in MT of UGC and set some research goals from our findings. One of our main goals is to exploit comparable corpora in order to extract parallel or semantically similar sentences. To accomplish this task, we design a document alignment system to extract semantically similar bilingual document pairs using the bilingual comparable corpora. We then apply strategies to extract parallel or semantically similar sentences from comparable corpora by transforming the document alignment system into a sentence alignment system. We seek to improve the quality of parallel data extraction for UGC translation and assemble the extracted data with the existing human translated resources. Another objective of this research is to demonstrate the usefulness of MT-based sentiment analysis. However, when using openly available systems such as Google Translate, the translation process may alter the sentiment in the target language. To cope with this phenomenon, we instead build fine-grained sentiment translation models that focus on sentiment preservation in the target language during translation
    corecore