3 research outputs found

    A study on Image Caption using Double Embedding Technique and Bi-RNN

    Get PDF
    ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฌธ์žฅ ํ‘œํ˜„๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ด๋ฏธ์ง€ ํŠน์ง• ๋ฒกํ„ฐ์˜ ์†Œ๋ฉธ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” ์ด์ค‘ Embedding ๊ธฐ๋ฒ•๊ณผ ๋ฌธ๋งฅ์— ๋งž๋Š” ๋ฌธ์žฅ ์ˆœ์„œ๋ฅผ ์ƒ์„ฑํ•˜๋Š” Bidirectional Recurrent Neural Network(Bi-RNN)์„ ์ ์šฉํ•œ ๋””ํ…Œ์ผํ•œ ์ด๋ฏธ์ง€ ์บก์…˜ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด์ค‘ Embedding ๊ธฐ๋ฒ•์—์„œ, Word Embedding ๊ณผ์ •์ธ Embeddingโ… ์€ ์บก์…˜์˜ ํ‘œํ˜„๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์„ธํŠธ์˜ ์บก์…˜ ๋‹จ์–ด๋ฅผ One-hot encoding ๋ฐฉ์‹์„ ํ†ตํ•ด ๋ฒกํ„ฐํ™”ํ•˜๊ณ  Embeddingโ…ก๋Š” ์บก์…˜ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ด๋ฏธ์ง€ ํŠน์ง•์˜ ์†Œ๋ฉธ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ด๋ฏธ์ง€ ํŠน์ง• ๋ฒกํ„ฐ์™€ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์œตํ•ฉํ•จ์œผ๋กœ์จ ๋ฌธ์žฅ ๊ตฌ์„ฑ ์š”์†Œ์˜ ๋ˆ„๋ฝ์„ ๋ฐฉ์ง€ํ•œ๋‹ค. ๋˜ํ•œ ๋””์ฝ”๋” ์˜์—ญ์€ ์–ดํœ˜ ๋ฐ ์ด๋ฏธ์ง€ ํŠน์ง•์„ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ํš๋“ํ•˜๋Š” Bi-RNN์œผ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ๋ฌธ๋งฅ์— ๋งž๋Š” ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ํ•™์Šตํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ํ†ตํ•˜์—ฌ ํš๋“๋œ ์ „์ฒด ์ด๋ฏธ์ง€, ๋ฌธ์žฅ ํ‘œํ˜„, ๋ฌธ์žฅ ์ˆœ์„œ ํŠน์ง•๋“ค์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๊ณต๊ฐ„์ธ Multimodal ๋ ˆ์ด์–ด์— ์œตํ•ฉํ•จ์œผ๋กœ์จ ๋ฌธ์žฅ์˜ ์ˆœ์„œ์™€ ํ‘œํ˜„๋ ฅ์„ ๋ชจ๋‘ ๊ณ ๋ คํ•œ ๋””ํ…Œ์ผํ•œ ์บก์…˜์„ ์ƒ์„ฑํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ชจ๋ธ์€ Flickr 8K ๋ฐ Flickr 30K, MSCOCO์™€ ๊ฐ™์€ ์ด๋ฏธ์ง€ ์บก์…˜ ๋ฐ์ดํ„ฐ์„ธํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต ๋ฐ ํ‰๊ฐ€๋ฅผ ์ง„ํ–‰ํ•˜์˜€์œผ๋ฉฐ ๊ฐ๊ด€์ ์ธ BLEU์™€ METEOR ์ ์ˆ˜๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ ์„ฑ๋Šฅ์˜ ์šฐ์ˆ˜์„ฑ์„ ์ž…์ฆํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ์ œ์•ˆํ•œ ๋ชจ๋ธ์€ 3๊ฐœ์˜ ๋‹ค๋ฅธ ์บก์…˜ ๋ชจ๋ธ๋“ค์— ๋น„ํ•ด BLEU ์ ์ˆ˜๋Š” ์ตœ๋Œ€ 20.2์ , METEOR ์ ์ˆ˜๋Š” ์ตœ๋Œ€ 3.65์ ์ด ํ–ฅ์ƒ๋˜์—ˆ๋‹ค.|This thesis proposes a detailed image caption model that applies the double embedding technique to improve sentence expressiveness and to prevent vanishing of image feature vectors. It uses the bidirectional recurrent neural network (Bi-RNN) to generate a sequence of sentences and fit their contexts. In the double-embedding technique, embedding โ…  is a word-embedding process used to vectorize dataset captions through one-hot encoding to improve the expressiveness of the captions. Embedding โ…ก prevents missed sentence components by fusing image features and word vectors to prevent image features from vanishing during caption generation. The decoder area, composed of a Bi-RNN that acquires vocabulary and image features in both directions, learns the sequence of sentences that fits their contexts. Finally, through the encoder and decoder, the detailed image caption is generated by considering both sequence and sentence expressiveness by fusing the acquired image features, sentence presentation features, and sentence sequence features into a multimodal layer as a vector space. The proposed model was learned and evaluated using image caption datasets (e.g., Flickr 8K, Flickr 30K, and MSCOCO). The proven BLEU and METEOR scores demonstrate the superiority of the model. The proposed model achieved a BLEU score maximum of 20.2 points and a METEOR score maximum of 3.65 points, which is higher than the scores of other three caption models.๋ชฉ ์ฐจ ๋ชฉ ์ฐจ โ…ฐ ๊ทธ๋ฆผ ๋ฐ ํ‘œ ๋ชฉ์ฐจ โ…ฑ Abstract โ…ณ ์ œ 1 ์žฅ ์„œ ๋ก  01 ์ œ 2 ์žฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ ๋ฐ ํ‰๊ฐ€์ง€ํ‘œ 04 2.1 Convolutional Neural Network 04 2.2 Recurrent Neural Network 08 2.3 Long Short-Term Memory 10 2.4 Gated Recurrent Unit 13 2.5 Bidirectional Recurrent Neural Network 15 2.6 Bi-Lingual Evaluation Understudy 17 2.7 Metric for Evaluation of Translation with Explicit ORdering 20 ์ œ 3 ์žฅ ์ œ์•ˆํ•œ ์ด๋ฏธ์ง€ ์บก์…˜ ๋ชจ๋ธ 23 3.1 ์ด์ค‘ Embedding ๊ธฐ๋ฒ•๊ณผ Bi-RNN์„ ์ด์šฉํ•œ ์บก์…˜ ๊ตฌ์„ฑ ๊ณผ์ • 25 3.2 Multimodal ๋ ˆ์ด์–ด๋ฅผ ์ด์šฉํ•œ ์บก์…˜ ์ƒ์„ฑ ๊ณผ์ • 27 ์ œ 4 ์žฅ ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ 29 4.1 ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ • 29 4.2 ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„ 31 ์ œ 5 ์žฅ ๊ฒฐ ๋ก  41 ์ฐธ ๊ณ  ๋ฌธ ํ—Œ 42Maste
    corecore