4,155 research outputs found

    Bilateral Multi-Perspective Matching for Natural Language Sentences

    Full text link
    Natural language sentence matching is a fundamental technology for a variety of tasks. Previous approaches either match sentences from a single direction or only apply single granular (word-by-word or sentence-by-sentence) matching. In this work, we propose a bilateral multi-perspective matching (BiMPM) model under the "matching-aggregation" framework. Given two sentences PP and QQ, our model first encodes them with a BiLSTM encoder. Next, we match the two encoded sentences in two directions Pโ†’QP \rightarrow Q and Pโ†QP \leftarrow Q. In each matching direction, each time step of one sentence is matched against all time-steps of the other sentence from multiple perspectives. Then, another BiLSTM layer is utilized to aggregate the matching results into a fix-length matching vector. Finally, based on the matching vector, the decision is made through a fully connected layer. We evaluate our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.Comment: To appear in Proceedings of IJCAI 201

    VCD: Visual Causality Discovery for Cross-Modal Question Reasoning

    Full text link
    Existing visual question reasoning methods usually fail to explicitly discover the inherent causal mechanism and ignore jointly modeling cross-modal event temporality and causality. In this paper, we propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover temporal causal structure and mitigate visual spurious correlation by causal intervention. To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally and disentangle the visual spurious correlations by attention-based front-door causal intervention module named Local-Global Causal Attention Module (LGCAM). To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. Extensive experiments on four datasets demonstrate the superiority of CMQR for discovering visual causal structures and achieving robust question reasoning.Comment: 12 pages, 6 figures. arXiv admin note: substantial text overlap with arXiv:2207.1264

    Automated Question-Answering for Interactive Decision Support in Operations & Maintenance of Wind Turbines

    Get PDF
    Intelligent question-answering (QA) systems have witnessed increased interest in recent years, particularly in their ability to facilitate information access, data interpretation or decision support. The wind energy sector is one of the most promising sources of renewable energy, yet turbines regularly suffer from failures and operational inconsistencies, leading to downtimes and significant maintenance costs. Addressing these issues requires rapid interpretation of complex and dynamic data patterns under time-critical conditions. In this article, we present a novel approach that leverages interactive, natural language-based decision support for operations & maintenance (O&M) of wind turbines. The proposed interactive QA system allows engineers to pose domain-specific questions in natural language, and provides answers (in natural language) based on the automated retrieval of information on turbine sub-components, their properties and interactions, from a bespoke domain-specific knowledge graph. As data for specific faults is often sparse, we propose the use of paraphrase generation as a way to augment the existing dataset. Our QA system leverages encoder-decoder models to generate Cypher queries to obtain domain-specific facts from the KG database in response to user-posed natural language questions. Experiments with an attention-based sequence-to-sequence (Seq2Seq) model and a transformer show that the transformer accurately predicts up to 89.75% of responses to input questions, outperforming the Seq2Seq model marginally by 0.76%, though being 9.46 times more computationally efficient. The proposed QA system can help support engineers and technicians during O&M to reduce turbine downtime and operational costs, thus improving the reliability of wind energy as a source of renewable energy

    Tackling Sequence to Sequence Mapping Problems with Neural Networks

    Full text link
    In Natural Language Processing (NLP), it is important to detect the relationship between two sequences or to generate a sequence of tokens given another observed sequence. We call the type of problems on modelling sequence pairs as sequence to sequence (seq2seq) mapping problems. A lot of research has been devoted to finding ways of tackling these problems, with traditional approaches relying on a combination of hand-crafted features, alignment models, segmentation heuristics, and external linguistic resources. Although great progress has been made, these traditional approaches suffer from various drawbacks, such as complicated pipeline, laborious feature engineering, and the difficulty for domain adaptation. Recently, neural networks emerged as a promising solution to many problems in NLP, speech recognition, and computer vision. Neural models are powerful because they can be trained end to end, generalise well to unseen examples, and the same framework can be easily adapted to a new domain. The aim of this thesis is to advance the state-of-the-art in seq2seq mapping problems with neural networks. We explore solutions from three major aspects: investigating neural models for representing sequences, modelling interactions between sequences, and using unpaired data to boost the performance of neural models. For each aspect, we propose novel models and evaluate their efficacy on various tasks of seq2seq mapping.Comment: PhD thesi

    ์ž ์žฌ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•œ ์‹œ๊ฐ์  ์Šคํ† ๋ฆฌ๋กœ๋ถ€ํ„ฐ์˜ ์„œ์‚ฌ ํ…์ŠคํŠธ ์ƒ์„ฑ๊ธฐ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2019. 2. ์žฅ๋ณ‘ํƒ.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation. Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos. The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges. The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting. The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result. Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์€ ๋™๋ฌผ๋“ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์œ ์ธ์›๊ณผ ์ธ๋ฅ˜๋ฅผ ๊ตฌ๋ณ„์ง“๋Š” ์ค‘์š”ํ•œ ๋Šฅ๋ ฅ์ด๋‹ค. ์ธ๊ณต์ง€๋Šฅ์ด ์ผ์ƒ์ƒํ™œ ์†์—์„œ ์‚ฌ๋žŒ๋“ค๊ณผ ํ•จ๊ป˜ ์ง€๋‚ด๋ฉด์„œ ๊ทธ๋“ค์˜ ์ƒํ™œ ์† ๋งฅ๋ฝ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ํ•˜์ง€๋งŒ, ๊ธฐ์กด์˜ ์Šคํ† ๋ฆฌ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๋Š” ์–ธ์–ด์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€์œผ๋กœ ์ธํ•ด ์‚ฌ์ „์— ์ •์˜๋œ ์„ธ๊ณ„ ๋ชจ๋ธ ํ•˜์—์„œ ์ข‹์€ ํ’ˆ์งˆ์˜ ์ €์ž‘๋ฌผ์„ ์ƒ์„ฑํ•˜๋ ค๋Š” ๊ธฐ์ˆ ์ด ์ฃผ๋กœ ์—ฐ๊ตฌ๋˜์–ด ์™”๋‹ค. ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ์Šคํ† ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๋ ค๋Š” ์‹œ๋„๋“ค์€ ๋Œ€์ฒด๋กœ ์ž์—ฐ์–ด๋กœ ํ‘œํ˜„๋œ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•  ์ˆ˜ ๋ฐ–์— ์—†์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ์—์„œ ๊ฒช๋Š” ๋ฌธ์ œ๋“ค์„ ๋™์ผํ•˜๊ฒŒ ๊ฒช๋Š”๋‹ค. ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์‹œ๊ฐ์  ์ •๋ณด๊ฐ€ ํ•จ๊ป˜ ์—ฐ๋™๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹์˜ ๋ˆˆ๋ถ€์‹  ๋ฐœ์ „์— ํž˜์ž…์–ด ์‹œ๊ฐ๊ณผ ์–ธ์–ด ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋‹ค๋ฃจ๋Š” ์—ฐ๊ตฌ๋“ค์ด ๋Š˜์–ด๋‚˜๊ณ  ์žˆ๋‹ค. ์—ฐ๊ตฌ์˜ ๋น„์ „์œผ๋กœ์„œ, ์ธ๊ณต์ง€๋Šฅ ์—์ด์ „ํŠธ๊ฐ€ ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ์นด๋ฉ”๋ผ๋กœ ์ž…๋ ฅ๋ฐ›๋Š” ํ™˜๊ฒฝ ์†์— ๋†“์—ฌ์žˆ๋Š” ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด ์•ˆ์—์„œ ์ธ๊ณต์ง€๋Šฅ ์—์ด์ „ํŠธ๋Š” ์ฃผ๋ณ€์„ ๊ด€์ฐฐํ•˜๋ฉด์„œ ๊ทธ์— ๋Œ€ํ•œ ์Šคํ† ๋ฆฌ๋ฅผ ์ž์—ฐ์–ด ํ˜•ํƒœ๋กœ ์ƒ์„ฑํ•˜๊ณ , ์ƒ์„ฑ๋œ ์Šคํ† ๋ฆฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ์— ์ผ์–ด๋‚  ์Šคํ† ๋ฆฌ๋ฅผ ํ•œ ๋‹จ๊ณ„์—์„œ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๊นŒ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์‚ฌ์ง„ ๋ฐ ๋น„๋””์˜ค ์†์— ๋‚˜ํƒ€๋‚˜๋Š” ์Šคํ† ๋ฆฌ(visual story)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•, ๋‚ด๋Ÿฌํ‹ฐ๋ธŒ ํ…์ŠคํŠธ๋กœ์˜ ๋ณ€ํ™˜, ๊ฐ€๋ ค์ง„ ์‚ฌ๊ฑด ๋ฐ ๋‹ค์Œ ์‚ฌ๊ฑด์„ ์ถ”๋ก ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์„ ๋‹ค๋ฃฌ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ, ์—ฌ๋Ÿฌ ์žฅ์˜ ์‚ฌ์ง„์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฌธ์ œ(๋น„์ฃผ์–ผ ์Šคํ† ๋ฆฌํ…”๋ง)๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์ด ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๊ธ€๋ž™๋„ท(GLAC Net)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, ์‚ฌ์ง„๋“ค๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์ปจ๋ณผ๋ฃจ์…˜ ์‹ ๊ฒฝ๋ง, ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด ์ˆœํ™˜์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•œ๋‹ค. ์‹œํ€€์Šค-์‹œํ€€์Šค ๊ตฌ์กฐ์˜ ์ธ์ฝ”๋”๋กœ์„œ, ์ „์ฒด์ ์ธ ์ด์•ผ๊ธฐ ๊ตฌ์กฐ์˜ ํ‘œํ˜„์„ ์œ„ํ•ด ๋‹ค๊ณ„์ธต ์–‘๋ฐฉํ–ฅ ์ˆœํ™˜์‹ ๊ฒฝ๋ง์„ ๋ฐฐ์น˜ํ•˜๋˜ ๊ฐ ์‚ฌ์ง„ ๋ณ„ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์ด์šฉํ•˜๊ธฐ ์œ„ํ•ด ์ „์—ญ์ -๊ตญ๋ถ€์  ์ฃผ์˜์ง‘์ค‘ ๋ชจ๋ธ์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋˜ํ•œ, ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋™์•ˆ ๋งฅ๋ฝ์ •๋ณด์™€ ๊ตญ๋ถ€์ •๋ณด๋ฅผ ์žƒ์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์•ž์„  ๋ฌธ์žฅ ์ •๋ณด๋ฅผ ์ „๋‹ฌํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์œ„ ์ œ์•ˆ ๋ฐฉ๋ฒ•์œผ๋กœ ๋น„์ŠคํŠธ(VIST) ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ํ•™์Šตํ•˜์˜€๊ณ , ์ œ 1 ํšŒ ์‹œ๊ฐ์  ์Šคํ† ๋ฆฌํ…”๋ง ๋Œ€ํšŒ(visual storytelling challenge)์—์„œ ์‚ฌ๋žŒ ํ‰๊ฐ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ „์ฒด ์ ์ˆ˜ ๋ฐ 6 ํ•ญ๋ชฉ ๋ณ„๋กœ ๋ชจ๋‘ ์ตœ๊ณ ์ ์„ ๋ฐ›์•˜๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ์Šคํ† ๋ฆฌ์˜ ์ผ๋ถ€๊ฐ€ ๋ฌธ์žฅ๋“ค๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ๋ฌธ์žฅ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ์ž„์˜์˜ ๊ธธ์ด์˜ ์Šคํ† ๋ฆฌ์— ๋Œ€ํ•ด ์ž„์˜์˜ ์œ„์น˜์—์„œ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•ด์•ผ ํ•˜๊ณ , ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋‹จ๊ณ„ ์ˆ˜์— ๋ฌด๊ด€ํ•˜๊ฒŒ ์ž‘๋™ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆœํ™˜ ์‚ฌ๊ฑด ์ธ์ถœ ๋ชจ๋ธ(Recurrent Event Retrieval Models)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์€๋‹‰ ๊ณต๊ฐ„ ์ƒ์—์„œ ํ˜„์žฌ๊นŒ์ง€ ๋ˆ„์ ๋œ ๋งฅ๋ฝ๊ณผ ๋‹ค์Œ์— ๋ฐœ์ƒํ•  ์œ ๋ ฅ ์‚ฌ๊ฑด ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€๊น๊ฒŒ ํ•˜๋„๋ก ๋งฅ๋ฝ๋ˆ„์ ํ•จ์ˆ˜์™€ ๋‘ ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ด๋ฏธ ์ž…๋ ฅ๋˜์–ด ์žˆ๋˜ ์Šคํ† ๋ฆฌ์— ์ƒˆ๋กœ์šด ์‚ฌ๊ฑด์ด ์ž…๋ ฅ๋˜๋ฉด ์Œ์„ ํ˜•์  ์—ฐ์‚ฐ์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ๋งฅ๋ฝ์„ ๊ฐœ์„ ํ•˜์—ฌ ๋‹ค์Œ์— ๋ฐœ์ƒํ•  ์œ ๋ ฅํ•œ ์‚ฌ๊ฑด๋“ค์„ ์ฐพ๋Š”๋‹ค. ์ด ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฝ์Šคํ† ๋ฆฌ(ROCStories) ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์„ ํ•™์Šตํ•˜์˜€๊ณ , ์Šคํ† ๋ฆฌ ํด๋กœ์ฆˆ ํ…Œ์ŠคํŠธ(Story Cloze Test)๋ฅผ ํ†ตํ•ด ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์˜€์œผ๋ฉฐ, ํŠนํžˆ ์ž„์˜์˜ ๊ธธ์ด๋กœ ์ถ”๋ก ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ฒ• ์ค‘์— ์ตœ๊ณ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ์„ธ ๋ฒˆ์งธ๋กœ, ๋น„๋””์˜ค ์Šคํ† ๋ฆฌ์—์„œ ์‚ฌ๊ฑด ์‹œํ€€์Šค ์ค‘ ์ผ๋ถ€๊ฐ€ ๊ฐ€๋ ค์กŒ์„ ๋•Œ ์ด๋ฅผ ๋ณต๊ตฌํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ํŠนํžˆ, ๊ฐ ์‚ฌ๊ฑด์˜ ์˜๋ฏธ ์ •๋ณด์™€ ์ˆœ์„œ๋ฅผ ๋ชจ๋ธ์˜ ํ‘œํ˜„ ํ•™์Šต์— ๋ฐ˜์˜ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์€๋‹‰ ๊ณต๊ฐ„ ์ƒ์— ๊ฐ ์—ํ”ผ์†Œ๋“œ๋“ค์„ ๊ถค์  ํ˜•ํƒœ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ๋ฅผ ์žฌ์ƒ์„ฑ์„ ํ•˜์—ฌ ์Šคํ† ๋ฆฌ ์™„์„ฑ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ธ ๋น„์Šคํ† ๋ฆฌ๋„ท(ViStoryNet)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ฐ ์—ํ”ผ์†Œ๋“œ๋ฅผ ๊ถค์  ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ๊ฑด ๋ฌธ์žฅ์„ ์‚ฌ๊ณ ๋ฒกํ„ฐ(thought vector)๋กœ ๋ณ€ํ™˜ํ•˜๊ณ , ์—ฐ์† ์ด๋ฒคํŠธ ์ˆœ์„œ ์ž„๋ฒ ๋”ฉ์„ ํ†ตํ•ด ์ „ํ›„ ์‚ฌ๊ฑด๋“ค์ด ์„œ๋กœ ๊ฐ€๊น๊ฒŒ ์ž„๋ฒ ๋”ฉ๋˜๋„๋ก ํ•˜์—ฌ ํ•˜๋‚˜์˜ ์—ํ”ผ์†Œ๋“œ๊ฐ€ ๊ถค์ ์˜ ๋ชจ์–‘์„ ๊ฐ€์ง€๋„๋ก ํ•™์Šตํ•˜์˜€๋‹ค. ๋ฝ€๋กœ๋กœQA ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์„ ํ†ตํ•ด ์‹คํ—˜์ ์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ์ž„๋ฒ ๋”ฉ ๋œ ์—ํ”ผ์†Œ๋“œ๋“ค์€ ๊ถค์  ํ˜•ํƒœ๋กœ ์ž˜ ๋‚˜ํƒ€๋‚ฌ์œผ๋ฉฐ, ์—ํ”ผ์†Œ๋“œ๋“ค์„ ์žฌ์ƒ์„ฑ ํ•ด๋ณธ ๊ฒฐ๊ณผ ์ „์ฒด์ ์ธ ์ธก๋ฉด์—์„œ ์œ ์‚ฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€๋‹ค. ์œ„ ๊ฒฐ๊ณผ๋ฌผ๋“ค์€ ์นด๋ฉ”๋ผ๋กœ ์ž…๋ ฅ๋˜๋Š” ์ฃผ๋ณ€ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์Šคํ† ๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ผ๋ถ€ ๊ด€์ธก๋˜์ง€ ์•Š์€ ๋ถ€๋ถ„์„ ์ถ”๋ก ํ•˜๋ฉฐ, ํ–ฅํ›„ ์Šคํ† ๋ฆฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€์‘๋œ๋‹ค.Abstract i Chapter 1 Introduction 1 1.1 Story of Everyday lives in Videos and Story Understanding . . . 1 1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6 1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background and Related Work 10 2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14 2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15 2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17 2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18 2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19 Chapter 3 Visual Storytelling via Global-local Attention Cascading Networks 21 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26 3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27 3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28 3.3.2 Decoder: Story Generator with Attention and Cascading Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36 3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38 3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 4 Common Space Learning on Cumulative Contexts and the Next Events: Recurrent Event Retrieval Models 44 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45 4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Chapter 5 ViStoryNet: Order Embedding of Successive Events and the Networks for Story Regeneration 58 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60 5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62 5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62 5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62 5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64 5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67 5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68 5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68 5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73 5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73 5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Chapter 6 Concluding Remarks 80 6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80 6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81 6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81 ์ดˆ๋ก 101Docto
    • โ€ฆ
    corecore