4,155 research outputs found
Bilateral Multi-Perspective Matching for Natural Language Sentences
Natural language sentence matching is a fundamental technology for a variety
of tasks. Previous approaches either match sentences from a single direction or
only apply single granular (word-by-word or sentence-by-sentence) matching. In
this work, we propose a bilateral multi-perspective matching (BiMPM) model
under the "matching-aggregation" framework. Given two sentences and ,
our model first encodes them with a BiLSTM encoder. Next, we match the two
encoded sentences in two directions and . In
each matching direction, each time step of one sentence is matched against all
time-steps of the other sentence from multiple perspectives. Then, another
BiLSTM layer is utilized to aggregate the matching results into a fix-length
matching vector. Finally, based on the matching vector, the decision is made
through a fully connected layer. We evaluate our model on three tasks:
paraphrase identification, natural language inference and answer sentence
selection. Experimental results on standard benchmark datasets show that our
model achieves the state-of-the-art performance on all tasks.Comment: To appear in Proceedings of IJCAI 201
VCD: Visual Causality Discovery for Cross-Modal Question Reasoning
Existing visual question reasoning methods usually fail to explicitly
discover the inherent causal mechanism and ignore jointly modeling cross-modal
event temporality and causality. In this paper, we propose a visual question
reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover
temporal causal structure and mitigate visual spurious correlation by causal
intervention. To explicitly discover visual causal structure, the Visual
Causality Discovery (VCD) architecture is proposed to find question-critical
scene temporally and disentangle the visual spurious correlations by
attention-based front-door causal intervention module named Local-Global Causal
Attention Module (LGCAM). To align the fine-grained interactions between
linguistic semantics and spatial-temporal representations, we build an
Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal
co-occurrence interactions between visual and linguistic content. Extensive
experiments on four datasets demonstrate the superiority of CMQR for
discovering visual causal structures and achieving robust question reasoning.Comment: 12 pages, 6 figures. arXiv admin note: substantial text overlap with
arXiv:2207.1264
Automated Question-Answering for Interactive Decision Support in Operations & Maintenance of Wind Turbines
Intelligent question-answering (QA) systems have witnessed increased interest in recent years, particularly in their ability to facilitate information access, data interpretation or decision support. The wind energy sector is one of the most promising sources of renewable energy, yet turbines regularly suffer from failures and operational inconsistencies, leading to downtimes and significant maintenance costs. Addressing these issues requires rapid interpretation of complex and dynamic data patterns under time-critical conditions. In this article, we present a novel approach that leverages interactive, natural language-based decision support for operations & maintenance (O&M) of wind turbines. The proposed interactive QA system allows engineers to pose domain-specific questions in natural language, and provides answers (in natural language) based on the automated retrieval of information on turbine sub-components, their properties and interactions, from a bespoke domain-specific knowledge graph. As data for specific faults is often sparse, we propose the use of paraphrase generation as a way to augment the existing dataset. Our QA system leverages encoder-decoder models to generate Cypher queries to obtain domain-specific facts from the KG database in response to user-posed natural language questions. Experiments with an attention-based sequence-to-sequence (Seq2Seq) model and a transformer show that the transformer accurately predicts up to 89.75% of responses to input questions, outperforming the Seq2Seq model marginally by 0.76%, though being 9.46 times more computationally efficient. The proposed QA system can help support engineers and technicians during O&M to reduce turbine downtime and operational costs, thus improving the reliability of wind energy as a source of renewable energy
Recommended from our members
Towards Democratizing Data Science with Natural Language Interfaces
Data science has the potential to reshape many sectors of the modern society. This potential can be realized to its maximum only when data science becomes democratized, instead of centralized in a small group of expert data scientists. However, with data becoming more massive and heterogeneous, standing in stark contrast to the spreading demand of data science is the growing gap between human users and data: Every type of data requires extensive specialized training, either to learn a specific query language or a data analytics software. Towards the democratization of data science, in this dissertation we systematically investigate a promising research direction, natural language interface, to bridge the gap between users and data, and make it easier for users who are less technically proficient to access the data analytics power needed for on-demand problem solving and decision making.One of the largest obstacles for general users to access data is the proficiency requirement on formal languages (e.g., SQL) that machines use. Automatically parsing natural language commands from users into formal languages, natural language interfaces can thus play a critical role in democratizing data science. However, a pressing question that is largely left unanswered so far is, how to bootstrap a natural language interface for a new domain? The high cost of data collection and the data-hungry nature of the mainstream neural network models are significantly limiting the wide application of natural language interfaces. The main technical contribution of this dissertation is a systematic framework for bootstrapping natural language interfaces for new domains. Specifically, the proposed framework consists of three complimentary methods: (1) Collecting data at a low cost via crowdsourcing, (2) leveraging existing NLI data from other domains via transfer learning, and (3) letting a bootstrapped model to interact with real users so that it can refine itself over time. Combining the three methods forms a closed data loop for bootstrapping and refining natural language interfaces for any domain.The developed methodologies and frameworks in this dissertation hence pave the path for building data science platforms that everyone can use to process, query, and analyze their data without extensive specialized training. With such AI-powered platforms, users can stay focused on high-level thinking and decision making, instead of overwhelmed by low-level implementation and programming details --- ``\emph{Let machines understand human thinking. Don't let humans think like machines}.'
Tackling Sequence to Sequence Mapping Problems with Neural Networks
In Natural Language Processing (NLP), it is important to detect the
relationship between two sequences or to generate a sequence of tokens given
another observed sequence. We call the type of problems on modelling sequence
pairs as sequence to sequence (seq2seq) mapping problems. A lot of research has
been devoted to finding ways of tackling these problems, with traditional
approaches relying on a combination of hand-crafted features, alignment models,
segmentation heuristics, and external linguistic resources. Although great
progress has been made, these traditional approaches suffer from various
drawbacks, such as complicated pipeline, laborious feature engineering, and the
difficulty for domain adaptation. Recently, neural networks emerged as a
promising solution to many problems in NLP, speech recognition, and computer
vision. Neural models are powerful because they can be trained end to end,
generalise well to unseen examples, and the same framework can be easily
adapted to a new domain.
The aim of this thesis is to advance the state-of-the-art in seq2seq mapping
problems with neural networks. We explore solutions from three major aspects:
investigating neural models for representing sequences, modelling interactions
between sequences, and using unpaired data to boost the performance of neural
models. For each aspect, we propose novel models and evaluate their efficacy on
various tasks of seq2seq mapping.Comment: PhD thesi
์ ์ฌ ์๋ฒ ๋ฉ์ ํตํ ์๊ฐ์ ์คํ ๋ฆฌ๋ก๋ถํฐ์ ์์ฌ ํ ์คํธ ์์ฑ๊ธฐ ํ์ต
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2019. 2. ์ฅ๋ณํ.The ability to understand the story is essential to make humans unique from other primates as well as animals. The capability of story understanding is crucial for AI agents to live with people in everyday life and understand their context. However, most research on story AI focuses on automated story generation based on closed
worlds designed manually, which are widely used for computation authoring. Machine learning techniques on story corpora face similar problems of natural language processing such as omitting details and commonsense knowledge. Since the remarkable success of deep learning on computer vision field, increasing our interest in research on bridging between vision and language, vision-grounded story data will potentially improve the performance of story understanding and narrative text generation.
Let us assume that AI agents lie in the environment in which the sensing information is input by the camera. Those agents observe the surroundings, translate them into the story in natural language, and predict the following event or multiple ones sequentially. This dissertation study on the related problems: learning stories or generating the narrative text from image streams or videos.
The first problem is to generate a narrative text from a sequence of ordered images. As a solution, we introduce a GLAC Net (Global-local Attention Cascading Network). It translates from image sequences to narrative paragraphs in text as a encoder-decoder framework with sequence-to-sequence setting. It has
convolutional neural networks for extracting information from images, and recurrent neural networks for text generation. We introduce visual cue encoders with stacked bidirectional LSTMs, and all of the outputs of each layer are aggregated as contextualized image vectors to extract visual clues. The coherency of the generated text is further improved by conveying (cascading) the information of the previous sentence to the next sentence serially in the decoders. We evaluate the performance of it on the Visual storytelling (VIST) dataset. It outperforms other state-of-the-art results and shows the best scores in total score and all of 6 aspects in the visual storytelling challenge with evaluation of human judges.
The second is to predict the following events or narrative texts with the former parts of stories. It should be possible to predict at any step with an arbitrary length. We propose recurrent event retrieval models as a solution. They train a context accumulation function and two embedding functions, where make close the distance between the cumulative context at current time and the next probable events on a latent space. They update the cumulative context with a new event as a input using bilinear operations, and we can find the next event candidates with the updated cumulative context. We evaluate them for Story Cloze Test, they show competitive performance and the best in open-ended generation setting. Also, it demonstrates the working examples in an interactive setting.
The third deals with the study on composite representation learning for semantics and order for video stories. We embed each episode as a trajectory-like sequence of events on the latent space, and propose a ViStoryNet to regenerate video stories with them (tasks of story completion). We convert event sentences to thought vectors, and train functions to make successive event embed close each other to form episodes as trajectories. Bi-directional LSTMs are trained as sequence models, and decoders to generate event sentences with GRUs. We test them experimentally with PororoQA dataset, and observe that most of episodes show the form of trajectories. We use them to complete the blocked part of stories, and they show not perfect but overall similar result.
Those results above can be applied to AI agents in the living area sensing with their cameras, explain the situation as stories, infer some unobserved parts, and predict the future story.์คํ ๋ฆฌ๋ฅผ ์ดํดํ๋ ๋ฅ๋ ฅ์ ๋๋ฌผ๋ค ๋ฟ๋ง ์๋๋ผ ๋ค๋ฅธ ์ ์ธ์๊ณผ ์ธ๋ฅ๋ฅผ ๊ตฌ๋ณ์ง๋ ์ค์ํ ๋ฅ๋ ฅ์ด๋ค. ์ธ๊ณต์ง๋ฅ์ด ์ผ์์ํ ์์์ ์ฌ๋๋ค๊ณผ ํจ๊ป ์ง๋ด๋ฉด์ ๊ทธ๋ค์ ์ํ ์ ๋งฅ๋ฝ์ ์ดํดํ๊ธฐ ์ํด์๋ ์คํ ๋ฆฌ๋ฅผ ์ดํดํ๋ ๋ฅ๋ ฅ์ด ๋งค์ฐ ์ค์ํ๋ค. ํ์ง๋ง,
๊ธฐ์กด์ ์คํ ๋ฆฌ์ ๊ดํ ์ฐ๊ตฌ๋ ์ธ์ด์ฒ๋ฆฌ์ ์ด๋ ค์์ผ๋ก ์ธํด ์ฌ์ ์ ์ ์๋ ์ธ๊ณ ๋ชจ๋ธ ํ์์ ์ข์ ํ์ง์ ์ ์๋ฌผ์ ์์ฑํ๋ ค๋ ๊ธฐ์ ์ด ์ฃผ๋ก ์ฐ๊ตฌ๋์ด ์๋ค. ๊ธฐ๊ณํ์ต ๊ธฐ๋ฒ์ ํตํด ์คํ ๋ฆฌ๋ฅผ ๋ค๋ฃจ๋ ค๋ ์๋๋ค์ ๋์ฒด๋ก ์์ฐ์ด๋ก ํํ๋ ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํ ์ ๋ฐ์ ์์ด ์์ฐ์ด ์ฒ๋ฆฌ์์ ๊ฒช๋ ๋ฌธ์ ๋ค์ ๋์ผํ๊ฒ ๊ฒช๋๋ค. ์ด๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด์๋ ์๊ฐ์ ์ ๋ณด๊ฐ ํจ๊ป ์ฐ๋๋ ๋ฐ์ดํฐ๊ฐ ๋์์ด ๋ ์ ์๋ค. ์ต๊ทผ ๋ฅ๋ฌ๋์ ๋๋ถ์ ๋ฐ์ ์ ํ์
์ด ์๊ฐ๊ณผ ์ธ์ด ์ฌ์ด์ ๊ด๊ณ๋ฅผ ๋ค๋ฃจ๋ ์ฐ๊ตฌ๋ค์ด ๋์ด๋๊ณ
์๋ค. ์ฐ๊ตฌ์ ๋น์ ์ผ๋ก์, ์ธ๊ณต์ง๋ฅ ์์ด์ ํธ๊ฐ ์ฃผ๋ณ ์ ๋ณด๋ฅผ ์นด๋ฉ๋ผ๋ก ์
๋ ฅ๋ฐ๋ ํ๊ฒฝ ์์ ๋์ฌ์๋ ์ํฉ์ ์๊ฐํด ๋ณผ ์ ์๋ค. ์ด ์์์ ์ธ๊ณต์ง๋ฅ ์์ด์ ํธ๋ ์ฃผ๋ณ์ ๊ด์ฐฐํ๋ฉด์ ๊ทธ์ ๋ํ ์คํ ๋ฆฌ๋ฅผ ์์ฐ์ด ํํ๋ก ์์ฑํ๊ณ , ์์ฑ๋ ์คํ ๋ฆฌ๋ฅผ
๋ฐํ์ผ๋ก ๋ค์์ ์ผ์ด๋ ์คํ ๋ฆฌ๋ฅผ ํ ๋จ๊ณ์์ ์ฌ๋ฌ ๋จ๊ณ๊น์ง ์์ธกํ ์ ์๋ค. ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ฌ์ง ๋ฐ ๋น๋์ค ์์ ๋ํ๋๋ ์คํ ๋ฆฌ(visual story)๋ฅผ ํ์ตํ๋ ๋ฐฉ๋ฒ, ๋ด๋ฌํฐ๋ธ ํ
์คํธ๋ก์ ๋ณํ, ๊ฐ๋ ค์ง ์ฌ๊ฑด ๋ฐ ๋ค์ ์ฌ๊ฑด์ ์ถ๋ก ํ๋ ์ฐ๊ตฌ๋ค์
๋ค๋ฃฌ๋ค.
์ฒซ ๋ฒ์งธ๋ก, ์ฌ๋ฌ ์ฅ์ ์ฌ์ง์ด ์ฃผ์ด์ก์ ๋ ์ด๋ฅผ ๋ฐํ์ผ๋ก ์คํ ๋ฆฌ ํ
์คํธ๋ฅผ ์์ฑํ๋ ๋ฌธ์ (๋น์ฃผ์ผ ์คํ ๋ฆฌํ
๋ง)๋ฅผ ๋ค๋ฃฌ๋ค. ์ด ๋ฌธ์ ํด๊ฒฐ์ ์ํด ๊ธ๋๋ท(GLAC Net)์ ์ ์ํ์๋ค. ๋จผ์ , ์ฌ์ง๋ค๋ก๋ถํฐ ์ ๋ณด๋ฅผ ์ถ์ถํ๊ธฐ ์ํ ์ปจ๋ณผ๋ฃจ์
์ ๊ฒฝ๋ง, ๋ฌธ์ฅ์
์์ฑํ๊ธฐ ์ํด ์ํ์ ๊ฒฝ๋ง์ ์ด์ฉํ๋ค. ์ํ์ค-์ํ์ค ๊ตฌ์กฐ์ ์ธ์ฝ๋๋ก์, ์ ์ฒด์ ์ธ ์ด์ผ๊ธฐ ๊ตฌ์กฐ์ ํํ์ ์ํด ๋ค๊ณ์ธต ์๋ฐฉํฅ ์ํ์ ๊ฒฝ๋ง์ ๋ฐฐ์นํ๋ ๊ฐ ์ฌ์ง ๋ณ ์ ๋ณด๋ฅผ ํจ๊ป ์ด์ฉํ๊ธฐ ์ํด ์ ์ญ์ -๊ตญ๋ถ์ ์ฃผ์์ง์ค ๋ชจ๋ธ์ ์ ์ํ์๋ค. ๋ํ,
์ฌ๋ฌ ๋ฌธ์ฅ์ ์์ฑํ๋ ๋์ ๋งฅ๋ฝ์ ๋ณด์ ๊ตญ๋ถ์ ๋ณด๋ฅผ ์์ง ์๊ฒ ํ๊ธฐ ์ํด ์์ ๋ฌธ์ฅ ์ ๋ณด๋ฅผ ์ ๋ฌํ๋ ๋ฉ์ปค๋์ฆ์ ์ ์ํ์๋ค. ์ ์ ์ ๋ฐฉ๋ฒ์ผ๋ก ๋น์คํธ(VIST) ๋ฐ์ดํฐ ์งํฉ์ ํ์ตํ์๊ณ , ์ 1 ํ ์๊ฐ์ ์คํ ๋ฆฌํ
๋ง ๋ํ(visual storytelling challenge)์์ ์ฌ๋ ํ๊ฐ๋ฅผ ๊ธฐ์ค์ผ๋ก ์ ์ฒด ์ ์ ๋ฐ 6 ํญ๋ชฉ ๋ณ๋ก ๋ชจ๋ ์ต๊ณ ์ ์ ๋ฐ์๋ค.
๋ ๋ฒ์งธ๋ก, ์คํ ๋ฆฌ์ ์ผ๋ถ๊ฐ ๋ฌธ์ฅ๋ค๋ก ์ฃผ์ด์ก์ ๋ ์ด๋ฅผ ๋ฐํ์ผ๋ก ๋ค์ ๋ฌธ์ฅ์ ์์ธกํ๋ ๋ฌธ์ ๋ฅผ ๋ค๋ฃฌ๋ค. ์์์ ๊ธธ์ด์ ์คํ ๋ฆฌ์ ๋ํด ์์์ ์์น์์ ์์ธก์ด ๊ฐ๋ฅํด์ผ ํ๊ณ , ์์ธกํ๋ ค๋ ๋จ๊ณ ์์ ๋ฌด๊ดํ๊ฒ ์๋ํด์ผ ํ๋ค. ์ด๋ฅผ ์ํ ๋ฐฉ๋ฒ์ผ๋ก
์ํ ์ฌ๊ฑด ์ธ์ถ ๋ชจ๋ธ(Recurrent Event Retrieval Models)์ ์ ์ํ์๋ค. ์ด ๋ฐฉ๋ฒ์ ์๋ ๊ณต๊ฐ ์์์ ํ์ฌ๊น์ง ๋์ ๋ ๋งฅ๋ฝ๊ณผ ๋ค์์ ๋ฐ์ํ ์ ๋ ฅ ์ฌ๊ฑด ์ฌ์ด์ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ๊น๊ฒ ํ๋๋ก ๋งฅ๋ฝ๋์ ํจ์์ ๋ ๊ฐ์ ์๋ฒ ๋ฉ ํจ์๋ฅผ ํ์ตํ๋ค. ์ด๋ฅผ ํตํด ์ด๋ฏธ ์
๋ ฅ๋์ด ์๋ ์คํ ๋ฆฌ์ ์๋ก์ด ์ฌ๊ฑด์ด ์
๋ ฅ๋๋ฉด ์์ ํ์ ์ฐ์ฐ์ ํตํด ๊ธฐ์กด์ ๋งฅ๋ฝ์ ๊ฐ์ ํ์ฌ ๋ค์์ ๋ฐ์ํ ์ ๋ ฅํ ์ฌ๊ฑด๋ค์ ์ฐพ๋๋ค. ์ด ๋ฐฉ๋ฒ์ผ๋ก ๋ฝ์คํ ๋ฆฌ(ROCStories) ๋ฐ์ดํฐ์งํฉ์ ํ์ตํ์๊ณ , ์คํ ๋ฆฌ ํด๋ก์ฆ ํ
์คํธ(Story Cloze Test)๋ฅผ ํตํด ํ๊ฐํ ๊ฒฐ๊ณผ ๊ฒฝ์๋ ฅ ์๋ ์ฑ๋ฅ์ ๋ณด์์ผ๋ฉฐ, ํนํ ์์์ ๊ธธ์ด๋ก ์ถ๋ก ํ ์ ์๋ ๊ธฐ๋ฒ ์ค์ ์ต๊ณ ์ฑ๋ฅ์ ๋ณด์๋ค.
์ธ ๋ฒ์งธ๋ก, ๋น๋์ค ์คํ ๋ฆฌ์์ ์ฌ๊ฑด ์ํ์ค ์ค ์ผ๋ถ๊ฐ ๊ฐ๋ ค์ก์ ๋ ์ด๋ฅผ ๋ณต๊ตฌํ๋ ๋ฌธ์ ๋ฅผ ๋ค๋ฃฌ๋ค. ํนํ, ๊ฐ ์ฌ๊ฑด์ ์๋ฏธ ์ ๋ณด์ ์์๋ฅผ ๋ชจ๋ธ์ ํํ ํ์ต์ ๋ฐ์ํ๊ณ ์ ํ์๋ค. ์ด๋ฅผ ์ํด ์๋ ๊ณต๊ฐ ์์ ๊ฐ ์ํผ์๋๋ค์ ๊ถค์ ํํ๋ก ์๋ฒ ๋ฉํ๊ณ ,
์ด๋ฅผ ๋ฐํ์ผ๋ก ์คํ ๋ฆฌ๋ฅผ ์ฌ์์ฑ์ ํ์ฌ ์คํ ๋ฆฌ ์์ฑ์ ํ ์ ์๋ ๋ชจ๋ธ์ธ ๋น์คํ ๋ฆฌ๋ท(ViStoryNet)์ ์ ์ํ์๋ค. ๊ฐ ์ํผ์๋๋ฅผ ๊ถค์ ํํ๋ฅผ ๊ฐ์ง๊ฒ ํ๊ธฐ ์ํด ์ฌ๊ฑด ๋ฌธ์ฅ์ ์ฌ๊ณ ๋ฒกํฐ(thought vector)๋ก ๋ณํํ๊ณ , ์ฐ์ ์ด๋ฒคํธ ์์ ์๋ฒ ๋ฉ์
ํตํด ์ ํ ์ฌ๊ฑด๋ค์ด ์๋ก ๊ฐ๊น๊ฒ ์๋ฒ ๋ฉ๋๋๋ก ํ์ฌ ํ๋์ ์ํผ์๋๊ฐ ๊ถค์ ์ ๋ชจ์์ ๊ฐ์ง๋๋ก ํ์ตํ์๋ค. ๋ฝ๋ก๋กQA ๋ฐ์ดํฐ์งํฉ์ ํตํด ์คํ์ ์ผ๋ก ๊ฒฐ๊ณผ๋ฅผ ํ์ธํ์๋ค. ์๋ฒ ๋ฉ ๋ ์ํผ์๋๋ค์ ๊ถค์ ํํ๋ก ์ ๋ํ๋ฌ์ผ๋ฉฐ, ์ํผ์๋๋ค์ ์ฌ์์ฑ ํด๋ณธ ๊ฒฐ๊ณผ ์ ์ฒด์ ์ธ ์ธก๋ฉด์์ ์ ์ฌํ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์๋ค.
์ ๊ฒฐ๊ณผ๋ฌผ๋ค์ ์นด๋ฉ๋ผ๋ก ์
๋ ฅ๋๋ ์ฃผ๋ณ ์ ๋ณด๋ฅผ ๋ฐํ์ผ๋ก ์คํ ๋ฆฌ๋ฅผ ์ดํดํ๊ณ ์ผ๋ถ ๊ด์ธก๋์ง ์์ ๋ถ๋ถ์ ์ถ๋ก ํ๋ฉฐ, ํฅํ ์คํ ๋ฆฌ๋ฅผ ์์ธกํ๋ ๋ฐฉ๋ฒ๋ค์ ๋์๋๋ค.Abstract i
Chapter 1 Introduction 1
1.1 Story of Everyday lives in Videos and Story Understanding . . . 1
1.2 Problems to be addressed . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Approach and Contribution . . . . . . . . . . . . . . . . . . . . . 6
1.4 Organization of Dissertation . . . . . . . . . . . . . . . . . . . . . 9
Chapter 2 Background and Related Work 10
2.1 Why We Study Stories . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Latent Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Order Embedding and Ordinal Embedding . . . . . . . . . . . . 14
2.4 Comparison to Story Understanding . . . . . . . . . . . . . . . . 15
2.5 Story Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Abstract Event Representations . . . . . . . . . . . . . . . 17
2.5.2 Seq-to-seq Attentional Models . . . . . . . . . . . . . . . . 18
2.5.3 Story Generation from Images . . . . . . . . . . . . . . . 19
Chapter 3 Visual Storytelling via Global-local Attention Cascading
Networks 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Evaluation for Visual Storytelling . . . . . . . . . . . . . . . . . . 26
3.3 Global-local Attention Cascading Networks (GLAC Net) . . . . . 27
3.3.1 Encoder: Contextualized Image Vector Extractor . . . . . 28
3.3.2 Decoder: Story Generator with Attention and Cascading
Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.1 VIST Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 Experiment Settings . . . . . . . . . . . . . . . . . . . . . 33
3.4.3 Network Training Details . . . . . . . . . . . . . . . . . . 36
3.4.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 38
3.4.5 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 38
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Chapter 4 Common Space Learning on Cumulative Contexts
and the Next Events: Recurrent Event Retrieval
Models 44
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Problems of Context Accumulation . . . . . . . . . . . . . . . . . 45
4.3 Recurrent Event Retrieval Models for Next Event Prediction . . 46
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.2 Story Cloze Test . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.3 Open-ended Story Generation . . . . . . . . . . . . . . . . 53
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter 5 ViStoryNet: Order Embedding of Successive Events
and the Networks for Story Regeneration 58
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Order Embedding with Triple Learning . . . . . . . . . . . . . . 60
5.2.1 Embedding Ordered Objects in Sequences . . . . . . . . . 62
5.3 Problems and Contextual Events . . . . . . . . . . . . . . . . . . 62
5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 62
5.3.2 Contextual Event Vectors from Kids Videos . . . . . . . . 64
5.4 Architectures for the Story Regeneration Task . . . . . . . . . . . 67
5.4.1 Two Sentence Generators as Decoders . . . . . . . . . . . 68
5.4.2 Successive Event Order Embedding (SEOE) . . . . . . . . 68
5.4.3 Sequence Models of the Event Space . . . . . . . . . . . . 72
5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . 73
5.5.2 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 73
5.5.3 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . 74
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Chapter 6 Concluding Remarks 80
6.1 Summary of Methods and Contributions . . . . . . . . . . . . . . 80
6.2 Limitation and Outlook . . . . . . . . . . . . . . . . . . . . . . . 81
6.3 Suggestions for Future Research . . . . . . . . . . . . . . . . . . . 81
์ด๋ก 101Docto
Recommended from our members
History Modeling for Conversational Information Retrieval
Conversational search is an embodiment of an iterative and interactive approach to information retrieval (IR) that has been studied for decades. Due to the recent rise of intelligent personal assistants, such as Siri, Alexa, AliMe, Cortana, and Google Assistant, a growing part of the population is moving their information-seeking activities to voice- or text-based conversational interfaces. One of the major challenges of conversational search is to leverage the conversation history to understand and fulfill the users\u27 information needs. In this dissertation work, we investigate history modeling approaches for conversational information retrieval. We start from history modeling for user intent prediction. We analyze information-seeking conversations by user intent distribution, co-occurrence, and flow patterns, followed by a study of user intent prediction in an information-seeking setting with both feature-based methods and deep learning methods. We then move to history modeling for conversational question answering (ConvQA), which can be considered as a simplified setting of conversational search. We first propose a positional history answer embedding (PosHAE) method to seamlessly integrate conversation history into a ConvQA model based on BERT. We then build upon this method and design a history attention mechanism (HAM) to conduct a ``soft selection\u27\u27 for conversation history. After this, we extend the previous ConvQA task to an open-retrieval (ORConvQA) setting to emphasize the fundamental role of retrieval in conversational search. In this setting, we learn to retrieve evidence from a large collection before extracting answers. We build an end-to-end system for ORConvQA, featuring a learnable dense retriever. We conduct experiments with both fully-supervised and weakly-supervised approaches to tackle the training challenges of ORConvQA. Finally, we study history modeling for conversational re-ranking. Given a history of user feedback behaviors, such as issuing a query, clicking a document, and skipping a document, we propose to introduce behavior awareness to a neural ranker. Our experimental results show that the history modeling approaches proposed in this dissertation can effectively improve the performance of different conversation tasks and provide new insights into conversational information retrieval
- โฆ