1 research outputs found

    ์ž ์žฌ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•œ ์‹œ๊ฐ์  ์Šคํ† ๋ฆฌ๋กœ๋ถ€ํ„ฐ์˜ ์„œ์‚ฌ ํ…์ŠคํŠธ ์ƒ์„ฑ๊ธฐ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์žฅ๋ณ‘ํƒ.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation. Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos. The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges. The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting. The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result. Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์€ ๋™๋ฌผ๋“ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์œ ์ธ์›๊ณผ ์ธ๋ฅ˜๋ฅผ ๊ตฌ๋ณ„์ง“๋Š” ์ค‘์š”ํ•œ ๋Šฅ๋ ฅ์ด๋‹ค. ์ธ๊ณต์ง€๋Šฅ์ด ์ผ์ƒ์ƒํ™œ ์†์—์„œ ์‚ฌ๋žŒ๋“ค๊ณผ ํ•จ๊ป˜ ์ง€๋‚ด๋ฉด์„œ ๊ทธ๋“ค์˜ ์ƒํ™œ ์† ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ, ๊ธฐ์กด์˜ ์Šคํ† ๋ฆฌ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๋Š” ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€์œผ๋กœ ์ธํ•ด ์‚ฌ์ „์— ์ •์˜๋œ ์„ธ๊ณ„ ๋ชจ๋ธ ํ•˜์—์„œ ์ข‹์€ ํ’ˆ์งˆ์˜ ์ €์ž‘๋ฌผ์„ ์ƒ์„ฑํ•˜๋ ค๋Š” ๊ธฐ์ˆ ์ด ์ฃผ๋กœ ์—ฐ๊ตฌ๋˜์–ด ์™”๋‹ค. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์Šคํ† ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๋ ค๋Š” ์‹œ๋„๋“ค์€ ๋Œ€์ฒด๋กœ ์ž์—ฐ์–ด๋กœ ํ‘œํ˜„๋œ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•  ์ˆ˜ ๋ฐ–์— ์—†์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ๊ฒช๋Š” ๋ฌธ์ œ๋“ค์„ ๋™์ผํ•˜๊ฒŒ ๊ฒช๋Š”๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‹œ๊ฐ์  ์ •๋ณด๊ฐ€ ํ•จ๊ป˜ ์—ฐ๋™๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ˆˆ๋ถ€์‹  ๋ฐœ์ „์— ํž˜์ž…์–ด ์‹œ๊ฐ๊ณผ ์–ธ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ๋“ค์ด ๋Š˜์–ด๋‚˜๊ณ  ์žˆ๋‹ค. ์—ฐ๊ตฌ์˜ ๋น„์ „์œผ๋กœ์„œ, ์ธ๊ณต์ง€๋Šฅ ์—์ด์ „ํŠธ๊ฐ€ ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ์นด๋ฉ”๋ผ๋กœ ์ž…๋ ฅ๋ฐ›๋Š” ํ™˜๊ฒฝ ์†์— ๋†“์—ฌ์žˆ๋Š” ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด ์•ˆ์—์„œ ์ธ๊ณต์ง€๋Šฅ ์—์ด์ „ํŠธ๋Š” ์ฃผ๋ณ€์„ ๊ด€์ฐฐํ•˜๋ฉด์„œ ๊ทธ์— ๋Œ€ํ•œ ์Šคํ† ๋ฆฌ๋ฅผ ์ž์—ฐ์–ด ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๊ณ , ์ƒ์„ฑ๋œ ์Šคํ† ๋ฆฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ์— ์ผ์–ด๋‚  ์Šคํ† ๋ฆฌ๋ฅผ ํ•œ ๋‹จ๊ณ„์—์„œ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๊นŒ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์ง„ ๋ฐ ๋น„๋””์˜ค ์†์— ๋‚˜ํƒ€๋‚˜๋Š” ์Šคํ† ๋ฆฌ(visual story)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•, ๋‚ด๋Ÿฌํ‹ฐ๋ธŒ ํ…์ŠคํŠธ๋กœ์˜ ๋ณ€ํ™˜, ๊ฐ€๋ ค์ง„ ์‚ฌ๊ฑด ๋ฐ ๋‹ค์Œ ์‚ฌ๊ฑด์„ ์ถ”๋ก ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์„ ๋‹ค๋ฃฌ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์—ฌ๋Ÿฌ ์žฅ์˜ ์‚ฌ์ง„์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ(๋น„์ฃผ์–ผ ์Šคํ† ๋ฆฌํ…”๋ง)๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๊ธ€๋ž™๋„ท(GLAC Net)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, ์‚ฌ์ง„๋“ค๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง, ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ˆœํ™˜์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ๋‹ค. ์‹œํ€€์Šค-์‹œํ€€์Šค ๊ตฌ์กฐ์˜ ์ธ์ฝ”๋”๋กœ์„œ, ์ „์ฒด์ ์ธ ์ด์•ผ๊ธฐ ๊ตฌ์กฐ์˜ ํ‘œํ˜„์„ ์œ„ํ•ด ๋‹ค๊ณ„์ธต ์–‘๋ฐฉํ–ฅ ์ˆœํ™˜์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๋˜ ๊ฐ ์‚ฌ์ง„ ๋ณ„ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ „์—ญ์ -๊ตญ๋ถ€์  ์ฃผ์˜์ง‘์ค‘ ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋™์•ˆ ๋งฅ๋ฝ์ •๋ณด์™€ ๊ตญ๋ถ€์ •๋ณด๋ฅผ ์žƒ์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์•ž์„  ๋ฌธ์žฅ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์œ„ ์ œ์•ˆ ๋ฐฉ๋ฒ•์œผ๋กœ ๋น„์ŠคํŠธ(VIST) ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ํ•™์Šตํ•˜์˜€๊ณ , ์ œ 1 ํšŒ ์‹œ๊ฐ์  ์Šคํ† ๋ฆฌํ…”๋ง ๋Œ€ํšŒ(visual storytelling challenge)์—์„œ ์‚ฌ๋žŒ ํ‰๊ฐ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ „์ฒด ์ ์ˆ˜ ๋ฐ 6 ํ•ญ๋ชฉ ๋ณ„๋กœ ๋ชจ๋‘ ์ตœ๊ณ ์ ์„ ๋ฐ›์•˜๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šคํ† ๋ฆฌ์˜ ์ผ๋ถ€๊ฐ€ ๋ฌธ์žฅ๋“ค๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋ฌธ์žฅ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์ž„์˜์˜ ๊ธธ์ด์˜ ์Šคํ† ๋ฆฌ์— ๋Œ€ํ•ด ์ž„์˜์˜ ์œ„์น˜์—์„œ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•ด์•ผ ํ•˜๊ณ , ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋‹จ๊ณ„ ์ˆ˜์— ๋ฌด๊ด€ํ•˜๊ฒŒ ์ž‘๋™ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆœํ™˜ ์‚ฌ๊ฑด ์ธ์ถœ ๋ชจ๋ธ(Recurrent Event Retrieval Models)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์€๋‹‰ ๊ณต๊ฐ„ ์ƒ์—์„œ ํ˜„์žฌ๊นŒ์ง€ ๋ˆ„์ ๋œ ๋งฅ๋ฝ๊ณผ ๋‹ค์Œ์— ๋ฐœ์ƒํ•  ์œ ๋ ฅ ์‚ฌ๊ฑด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€๊น๊ฒŒ ํ•˜๋„๋ก ๋งฅ๋ฝ๋ˆ„์ ํ•จ์ˆ˜์™€ ๋‘ ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ด๋ฏธ ์ž…๋ ฅ๋˜์–ด ์žˆ๋˜ ์Šคํ† ๋ฆฌ์— ์ƒˆ๋กœ์šด ์‚ฌ๊ฑด์ด ์ž…๋ ฅ๋˜๋ฉด ์Œ์„ ํ˜•์  ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ๋งฅ๋ฝ์„ ๊ฐœ์„ ํ•˜์—ฌ ๋‹ค์Œ์— ๋ฐœ์ƒํ•  ์œ ๋ ฅํ•œ ์‚ฌ๊ฑด๋“ค์„ ์ฐพ๋Š”๋‹ค. ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฝ์Šคํ† ๋ฆฌ(ROCStories) ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์„ ํ•™์Šตํ•˜์˜€๊ณ , ์Šคํ† ๋ฆฌ ํด๋กœ์ฆˆ ํ…Œ์ŠคํŠธ(Story Cloze Test)๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ž„์˜์˜ ๊ธธ์ด๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ• ์ค‘์— ์ตœ๊ณ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์„ธ ๋ฒˆ์งธ๋กœ, ๋น„๋””์˜ค ์Šคํ† ๋ฆฌ์—์„œ ์‚ฌ๊ฑด ์‹œํ€€์Šค ์ค‘ ์ผ๋ถ€๊ฐ€ ๊ฐ€๋ ค์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ณต๊ตฌํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๊ฑด์˜ ์˜๋ฏธ ์ •๋ณด์™€ ์ˆœ์„œ๋ฅผ ๋ชจ๋ธ์˜ ํ‘œํ˜„ ํ•™์Šต์— ๋ฐ˜์˜ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์€๋‹‰ ๊ณต๊ฐ„ ์ƒ์— ๊ฐ ์—ํ”ผ์†Œ๋“œ๋“ค์„ ๊ถค์  ํ˜•ํƒœ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ๋ฅผ ์žฌ์ƒ์„ฑ์„ ํ•˜์—ฌ ์Šคํ† ๋ฆฌ ์™„์„ฑ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ธ ๋น„์Šคํ† ๋ฆฌ๋„ท(ViStoryNet)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ฐ ์—ํ”ผ์†Œ๋“œ๋ฅผ ๊ถค์  ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ๊ฑด ๋ฌธ์žฅ์„ ์‚ฌ๊ณ ๋ฒกํ„ฐ(thought vector)๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์—ฐ์† ์ด๋ฒคํŠธ ์ˆœ์„œ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ์ „ํ›„ ์‚ฌ๊ฑด๋“ค์ด ์„œ๋กœ ๊ฐ€๊น๊ฒŒ ์ž„๋ฒ ๋”ฉ๋˜๋„๋ก ํ•˜์—ฌ ํ•˜๋‚˜์˜ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๊ถค์ ์˜ ๋ชจ์–‘์„ ๊ฐ€์ง€๋„๋ก ํ•™์Šตํ•˜์˜€๋‹ค. ๋ฝ€๋กœ๋กœQA ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์„ ํ†ตํ•ด ์‹คํ—˜์ ์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ์ž„๋ฒ ๋”ฉ ๋œ ์—ํ”ผ์†Œ๋“œ๋“ค์€ ๊ถค์  ํ˜•ํƒœ๋กœ ์ž˜ ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, ์—ํ”ผ์†Œ๋“œ๋“ค์„ ์žฌ์ƒ์„ฑ ํ•ด๋ณธ ๊ฒฐ๊ณผ ์ „์ฒด์ ์ธ ์ธก๋ฉด์—์„œ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ์œ„ ๊ฒฐ๊ณผ๋ฌผ๋“ค์€ ์นด๋ฉ”๋ผ๋กœ ์ž…๋ ฅ๋˜๋Š” ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ผ๋ถ€ ๊ด€์ธก๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„์„ ์ถ”๋ก ํ•˜๋ฉฐ, ํ–ฅํ›„ ์Šคํ† ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€์‘๋œ๋‹ค.Abstract i Chapter 1 Introduction 1 1.1 Story of Everyday lives in Videos and Story Understanding . . . 1 1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background and Related Work 10 2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14 2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15 2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17 2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18 2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19 Chapter 3 Visual Storytelling via Global-local Attention Cascading Networks 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26 3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27 3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28 3.3.2 Decoder: Story Generator with Attention and Cascading Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36 3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4 Common Space Learning on Cumulative Contexts and the Next Events: Recurrent Event Retrieval Models 44 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45 4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 ViStoryNet: Order Embedding of Successive Events and the Networks for Story Regeneration 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60 5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62 5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62 5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62 5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64 5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67 5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68 5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68 5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73 5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6 Concluding Remarks 80 6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80 6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81 ์ดˆ๋ก 101Docto
    corecore