149 research outputs found
Video Question Answering on Screencast Tutorials
This paper presents a new video question answering task on screencast
tutorials. We introduce a dataset including question, answer and context
triples from the tutorial videos for a software. Unlike other video question
answering works, all the answers in our dataset are grounded to the domain
knowledge base. An one-shot recognition algorithm is designed to extract the
visual cues, which helps enhance the performance of video question answering.
We also propose several baseline neural network architectures based on various
aspects of video contexts from the dataset. The experimental results
demonstrate that our proposed models significantly improve the question
answering performances by incorporating multi-modal contexts and domain
knowledge
Orbital-selective confinement effect of Ru orbitals in SrRuO ultrathin film
The electronic structure of SrRuO thin film with thickness from 50 to 1
unit cell (u.c.) is investigated via the resonant inelastic x-ray scattering
(RIXS) technique at the O K-edge to unravel the intriguing interplay of orbital
and charge degrees of freedom. We found that orbital-selective quantum
confinement effect (QCE) induces the splitting of Ru orbitals. At the same
time, we observed a clear suppression of the electron-hole continuum across the
metal-to-insulator transition (MIT) occurring at the 4 u.c. sample. From these
two clear observations we conclude that QCE gives rise to a Mott insulating
phase in ultrathin SrRuO films. Our interpretation of the RIXS spectra is
supported by the configuration interaction calculations of RuO clusters.Comment: 7 pages, 7 figure
Just Ask:An Interactive Learning Framework for Vision and Language Navigation
In the vision and language navigation task, the agent may encounter ambiguous
situations that are hard to interpret by just relying on visual information and
natural language instructions. We propose an interactive learning framework to
endow the agent with the ability to ask for users' help in such situations. As
part of this framework, we investigate multiple learning approaches for the
agent with different levels of complexity. The simplest model-confusion-based
method lets the agent ask questions based on its confusion, relying on the
predefined confidence threshold of a next action prediction model. To build on
this confusion-based method, the agent is expected to demonstrate more
sophisticated reasoning such that it discovers the timing and locations to
interact with a human. We achieve this goal using reinforcement learning (RL)
with a proposed reward shaping term, which enables the agent to ask questions
only when necessary. The success rate can be boosted by at least 15% with only
one question asked on average during the navigation. Furthermore, we show that
the RL agent is capable of adjusting dynamically to noisy human responses.
Finally, we design a continual learning strategy, which can be viewed as a data
augmentation method, for the agent to improve further utilizing its interaction
history with a human. We demonstrate the proposed strategy is substantially
more realistic and data-efficient compared to previously proposed
pre-exploration techniques.Comment: 8 pages, accepted to AAAI 202
- …