93 research outputs found
Jointly Modeling Embedding and Translation to Bridge Video and Language
Automatically describing video content with natural language is a fundamental
challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence
dynamics, has attracted increasing attention on visual interpretation. However,
most existing approaches generate a word locally with given previous words and
the visual content, while the relationship between sentence semantics and
visual content is not holistically exploited. As a result, the generated
sentences may be contextually correct but the semantics (e.g., subjects, verbs
or objects) are not true.
This paper presents a novel unified framework, named Long Short-Term Memory
with visual-semantic Embedding (LSTM-E), which can simultaneously explore the
learning of LSTM and visual-semantic embedding. The former aims to locally
maximize the probability of generating the next word given previous words and
visual content, while the latter is to create a visual-semantic embedding space
for enforcing the relationship between the semantics of the entire sentence and
visual content. Our proposed LSTM-E consists of three components: a 2-D and/or
3-D deep convolutional neural networks for learning powerful video
representation, a deep RNN for generating sentences, and a joint embedding
model for exploring the relationships between visual content and sentence
semantics. The experiments on YouTube2Text dataset show that our proposed
LSTM-E achieves to-date the best reported performance in generating natural
sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also
demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO)
triplets to several state-of-the-art techniques
Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
It is well believed that video captioning is a fundamental but challenging
task in both computer vision and artificial intelligence fields. The prevalent
approach is to map an input video to a variable-length output sentence in a
sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless,
the training of RNN still suffers to some degree from vanishing/exploding
gradient problem, making the optimization difficult. Moreover, the inherently
recurrent dependency in RNN prevents parallelization within a sequence during
training and therefore limits the computations. In this paper, we present a
novel design --- Temporal Deformable Convolutional Encoder-Decoder Networks
(dubbed as TDConvED) that fully employ convolutions in both encoder and decoder
networks for video captioning. Technically, we exploit convolutional block
structures that compute intermediate states of a fixed number of inputs and
stack several blocks to capture long-term relationships. The structure in
encoder is further equipped with temporal deformable convolution to enable
free-form deformation of temporal sampling. Our model also capitalizes on
temporal attention mechanism for sentence generation. Extensive experiments are
conducted on both MSVD and MSR-VTT video captioning datasets, and superior
results are reported when comparing to conventional RNN-based encoder-decoder
techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8%
to 67.2% on MSVD.Comment: AAAI 201
Exploring Object Relation in Mean Teacher for Cross-Domain Detection
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
annotations for learning deep models in vision tasks has attracted increasing
attention in recent years. However, simply applying the models learnt on
synthetic images may lead to high generalization error on real images due to
domain shift. To address this issue, recent progress in cross-domain
recognition has featured the Mean Teacher, which directly simulates
unsupervised domain adaptation as semi-supervised learning. The domain gap is
thus naturally bridged with consistency regularization in a teacher-student
scheme. In this work, we advance this Mean Teacher paradigm to be applicable
for cross-domain detection. Specifically, we present Mean Teacher with Object
Relations (MTOR) that novelly remolds Mean Teacher under the backbone of Faster
R-CNN by integrating the object relations into the measure of consistency cost
between teacher and student modules. Technically, MTOR firstly learns
relational graphs that capture similarities between pairs of regions for
teacher and student respectively. The whole architecture is then optimized with
three consistency regularizations: 1) region-level consistency to align the
region-level predictions between teacher and student, 2) inter-graph
consistency for matching the graph structures between teacher and student, and
3) intra-graph consistency to enhance the similarity between regions of same
class within the graph of student. Extensive experiments are conducted on the
transfers across Cityscapes, Foggy Cityscapes, and SIM10k, and superior results
are reported when comparing to state-of-the-art approaches. More remarkably, we
obtain a new record of single model: 22.8% of mAP on Syn2Real detection
dataset.Comment: CVPR 2019; The codes and model of our MTOR are publicly available at:
https://github.com/caiqi/mean-teacher-cross-domain-detectio
- …