2,252 research outputs found
Jointly Modeling Embedding and Translation to Bridge Video and Language
Automatically describing video content with natural language is a fundamental
challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence
dynamics, has attracted increasing attention on visual interpretation. However,
most existing approaches generate a word locally with given previous words and
the visual content, while the relationship between sentence semantics and
visual content is not holistically exploited. As a result, the generated
sentences may be contextually correct but the semantics (e.g., subjects, verbs
or objects) are not true.
This paper presents a novel unified framework, named Long Short-Term Memory
with visual-semantic Embedding (LSTM-E), which can simultaneously explore the
learning of LSTM and visual-semantic embedding. The former aims to locally
maximize the probability of generating the next word given previous words and
visual content, while the latter is to create a visual-semantic embedding space
for enforcing the relationship between the semantics of the entire sentence and
visual content. Our proposed LSTM-E consists of three components: a 2-D and/or
3-D deep convolutional neural networks for learning powerful video
representation, a deep RNN for generating sentences, and a joint embedding
model for exploring the relationships between visual content and sentence
semantics. The experiments on YouTube2Text dataset show that our proposed
LSTM-E achieves to-date the best reported performance in generating natural
sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also
demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO)
triplets to several state-of-the-art techniques
Learning Multi-Level Information for Dialogue Response Selection by Highway Recurrent Transformer
With the increasing research interest in dialogue response generation, there
is an emerging branch formulating this task as selecting next sentences, where
given the partial dialogue contexts, the goal is to determine the most probable
next sentence. Following the recent success of the Transformer model, this
paper proposes (1) a new variant of attention mechanism based on multi-head
attention, called highway attention, and (2) a recurrent model based on
transformer and the proposed highway attention, so-called Highway Recurrent
Transformer. Experiments on the response selection task in the seventh Dialog
System Technology Challenge (DSTC7) show the capability of the proposed model
of modeling both utterance-level and dialogue-level information; the
effectiveness of each module is further analyzed as well
- …