3 research outputs found

    Bilingual Lexicon Extraction Using a Modified Perceptron Algorithm

    Get PDF
    μ „μ‚° μ–Έμ–΄ν•™ λΆ„μ•Όμ—μ„œ 병렬 λ§λ­‰μΉ˜μ™€ 이쀑언어 μ–΄νœ˜λŠ” κΈ°κ³„λ²ˆμ—­κ³Ό ꡐ차 정보 탐색 λ“±μ˜ λΆ„μ•Όμ—μ„œ μ€‘μš”ν•œ μžμ›μœΌλ‘œ μ‚¬μš©λ˜κ³  μžˆλ‹€. 예λ₯Ό λ“€μ–΄, 병렬 λ§λ­‰μΉ˜λŠ” κΈ°κ³„λ²ˆμ—­ μ‹œμŠ€ν…œμ—μ„œ λ²ˆμ—­ ν™•λ₯ λ“€μ„ μΆ”μΆœν•˜λŠ”λ° μ‚¬μš©λœλ‹€. 이쀑언어 μ–΄νœ˜λŠ” ꡐ차 정보 νƒμƒ‰μ—μ„œ μ§μ ‘μ μœΌλ‘œ 단어 λŒ€ 단어 λ²ˆμ—­μ„ κ°€λŠ₯ν•˜κ²Œ ν•œλ‹€. λ˜ν•œ κΈ°κ³„λ²ˆμ—­ μ‹œμŠ€ν…œμ—μ„œ λ²ˆμ—­ ν”„λ‘œμ„ΈμŠ€λ₯Ό λ„μ™€μ£ΌλŠ” 역할을 ν•˜κ³  μžˆλ‹€. 그리고 ν•™μŠ΅μ„ μœ„ν•œ 병렬 λ§λ­‰μΉ˜μ™€ 이쀑언어 μ–΄νœ˜μ˜ μš©λŸ‰μ΄ 크면 클수둝 κΈ°κ³„λ²ˆμ—­ μ‹œμŠ€ν…œμ˜ μ„±λŠ₯이 ν–₯μƒλœλ‹€. κ·ΈλŸ¬λ‚˜ μ΄λŸ¬ν•œ 이쀑언어 μ–΄νœ˜λ₯Ό μˆ˜λ™μœΌλ‘œ, 즉 μ‚¬λžŒμ˜ 힘으둜 κ΅¬μΆ•ν•˜λŠ” 것은 λ§Žμ€ λΉ„μš©κ³Ό μ‹œκ°„κ³Ό 노동을 ν•„μš”λ‘œ ν•œλ‹€. μ΄λŸ¬ν•œ μ΄μœ λ“€ λ•Œλ¬Έμ— 이쀑언어 μ–΄νœ˜λ₯Ό μΆ”μΆœν•˜λŠ” 연ꡬ가 λ§Žμ€ μ—°κ΅¬μžλ“€μ—κ²Œ κ°κ΄‘λ°›κ²Œ λ˜μ—ˆλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” 이쀑언어 μ–΄νœ˜λ₯Ό μΆ”μΆœν•˜λŠ” μƒˆλ‘­κ³  효과적인 방법둠을 μ œμ•ˆν•œλ‹€. 이쀑언어 μ–΄νœ˜ μΆ”μΆœμ—μ„œ κ°€μž₯ 많이 λ‹€λ£¨μ–΄μ§€λŠ” 벑터 곡간 λͺ¨λΈμ„ 기반으둜 ν•˜κ³ , μ‹ κ²½λ§μ˜ ν•œ μ’…λ₯˜μΈ νΌμ…‰νŠΈλ‘  μ•Œκ³ λ¦¬μ¦˜μ„ μ‚¬μš©ν•˜μ—¬ 이쀑언어 μ–΄νœ˜μ˜ κ°€μ€‘μΉ˜λ₯Ό λ°˜λ³΅ν•΄μ„œ ν•™μŠ΅ν•œλ‹€. 그리고 반볡적으둜 ν•™μŠ΅λœ 이쀑언어 μ–΄νœ˜μ˜ κ°€μ€‘μΉ˜μ™€ νΌμ…‰νŠΈλ‘ μ„ μ‚¬μš©ν•˜μ—¬ μ΅œμ’… 이쀑언어 μ–΄νœ˜λ“€μ„ μΆ”μΆœν•œλ‹€. κ·Έ κ²°κ³Ό, ν•™μŠ΅λ˜μ§€ μ•Šμ€ 초기의 결과에 λΉ„ν•΄μ„œ 반볡 ν•™μŠ΅λœ κ²°κ³Όκ°€ 평균 3.5%의 정확도 ν–₯상을 얻을 수 μžˆμ—ˆλ‹€1. Introduction 2. Literature Review 2.1 Linguistic resources: The text corpora 2.2 A vector space model 2.3 Neural networks: The single layer Perceptron 2.4 Evaluation metrics 3. System Architecture of Bilingual Lexicon Extraction System 3.1 Required linguistic resources 3.2 System architecture 4. Building a Seed Dictionary 4.1 Methodology: Context Based Approach (CBA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 5. Extracting Bilingual Lexicons 4.1 Methodology: Iterative Approach (IA) 4.2 Experiments and results 4.2.1 Experimental setups 4.2.2 Experimental results 4.3 Discussions 6. Conclusions and Future Work

    좔상 λ¬Έμ„œ μš”μ•½μ„ μœ„ν•œ 게이트된 ν•©μ„±κ³± 신경망과 κΉŠμ€ μΈ΅ μœ΅ν•©

    No full text
    DoctorText summarization is one of the central tasks in Natural Language Processing. Recent advances in deep neural networks and representation learning have substantially improved text summarization technology. There are largely two approaches to text summarization: extractive and abstractive. The extractive approach generate a summary by extracting salient linguistic constitutes from the document and assembling them to make grammatical sentences. In contrast, the abstractive approach write summaries using words that may or may not exist in the document using sophisticated techniques such as meaning representation, content organization and surface realization. In this thesis, we focus on abstractive summarization, and propose a model to represent and recognize salient content better from a document that is one of the major abilities to better text summarization. Furthermore, we introduce a large-scale Korean dataset for document summarization. First of all, we adopt a hierarchical structure to capture various ranges of the representation. Moreover, we propose a gating mechanism to make better intermediate representations and we utilize POS (Part-of-Speech) tags to use morphological and syntactic features. Lastly, we propose a simple and efficient deep layer fusion to extract and merge salient information from the encoder layers. We evaluate our model using ROUGE metrics on three different datasets: CNN-DM, NEWSROOM-ABS, and XSUM. Experimental results show that the proposed model outperforms the state-of-the-art abstractive models on NEWSROOM-ABS and XSUM and shows comparable scores on CNN-DM. These data-driven approaches require a large amount of data for model training. However, large-scale datasets do not exist for less well-known languages such as Korean, and building such a dataset is very labor-intensive and time-consuming. In this thesis, we propose Korean summarization datasets that are acquired automatically by leveraging the characteristics of news articles. The dataset consists of 206,822 article-summary pairs in which summaries are written in headline-style with multiple sentences. With analysis of our dataset and experimental results, we showed that the proposed dataset is being fairly large to train an abstractive summarization model, comparable to existing English news datasets and suitable for develop abstractive summarization models

    Adoption of a Neural Language Model in an Encoder for Encoder-Decoder based Korean Grammatical Error Correction

    No full text
    문법 였λ₯˜ ꡐ정은 주어진 λ¬Έμž₯μ—μ„œ λ‚˜νƒ€λ‚œ 문법적인 였λ₯˜λ“€μ„ νƒμ§€ν•˜κ³  이λ₯Ό μ˜¬λ°”λ₯΄κ²Œ κ΅μ •ν•˜λŠ” κ²ƒμœΌλ‘œ, νŠΉμ • μ–Έμ–΄λ₯Ό 배우고자 ν•˜λŠ” L2 ν•™μŠ΅μžλ“€μ„ λ•κ±°λ‚˜ μ‹œμŠ€ν…œμ˜ 잘λͺ»λœ μž…μΆœλ ₯ μˆ˜μ • λ“± λ‹€μ–‘ν•œ μ‘μš© 뢄야에 ν™œμš© κ°€λŠ₯ν•˜λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” ν•œκ΅­μ–΄ 문법 였λ₯˜ ꡐ정 ν•™μŠ΅μ— ν•„μˆ˜μ μΈ ꡐ정 병렬 데이터가 λΆ€μ‘±ν•œ 문제λ₯Ό λ³΄μ™„ν•˜κΈ° μœ„ν•˜μ—¬ 단일 λ§λ­‰μΉ˜λ₯Ό ν™œμš©ν•˜λŠ” 기법을 μ œμ•ˆν•œλ‹€. 단일 λ§λ­‰μΉ˜λ‘œ ν•™μŠ΅μ‹œν‚¨ 신경망언어 λͺ¨λΈμ„ Encoder에 λ„μž…ν•˜μ—¬, 신경망 기계 λ²ˆμ—­ 기반 ꡐ정 λͺ¨λΈμ΄ μ˜¬λ°”λ₯΄κ²Œ μ‚¬μš©λœ 음절과 λ¬Έλ²•μ μœΌλ‘œ 잘λͺ» μ‚¬μš©λœ μŒμ ˆμ„ 보닀 λͺ…ν™•ν•˜κ²Œ ꡬ뢄할 수 있게 ν•œλ‹€. 이λ₯Ό ν† λŒ€λ‘œ, μ˜¬λ°”λ₯΄κ²Œ μ‚¬μš©λœ 음절의 λ³΅μ‚¬λŸ‰μ„ μ¦κ°€μ‹œν‚€λ©΄μ„œ κΈ°μ‘΄ Encoder-Decoder λͺ¨λΈμ˜ 잘λͺ»λœ ꡐ정을 λ°©μ§€ν•˜λŠ” 것을 확인할 수 μžˆμ—ˆλ‹€.22Nkc
    corecore