39,734 research outputs found
Strategies for Searching Video Content with Text Queries or Video Examples
The large number of user-generated videos uploaded on to the Internet
everyday has led to many commercial video search engines, which mainly rely on
text metadata for search. However, metadata is often lacking for user-generated
videos, thus these videos are unsearchable by current search engines.
Therefore, content-based video retrieval (CBVR) tackles this metadata-scarcity
problem by directly analyzing the visual and audio streams of each video. CBVR
encompasses multiple research topics, including low-level feature design,
feature fusion, semantic detector training and video search/reranking. We
present novel strategies in these topics to enhance CBVR in both accuracy and
speed under different query inputs, including pure textual queries and query by
video examples. Our proposed strategies have been incorporated into our
submission for the TRECVID 2014 Multimedia Event Detection evaluation, where
our system outperformed other submissions in both text queries and video
example queries, thus demonstrating the effectiveness of our proposed
approaches
BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection
Multimodal representation learning is gaining more and more interest within
the deep learning community. While bilinear models provide an interesting
framework to find subtle combination of modalities, their number of parameters
grows quadratically with the input dimensions, making their practical
implementation within classical deep learning pipelines challenging. In this
paper, we introduce BLOCK, a new multimodal fusion based on the
block-superdiagonal tensor decomposition. It leverages the notion of block-term
ranks, which generalizes both concepts of rank and mode ranks for tensors,
already used for multimodal fusion. It allows to define new ways for optimizing
the tradeoff between the expressiveness and complexity of the fusion model, and
is able to represent very fine interactions between modalities while
maintaining powerful mono-modal representations. We demonstrate the practical
interest of our fusion model by using BLOCK for two challenging tasks: Visual
Question Answering (VQA) and Visual Relationship Detection (VRD), where we
design end-to-end learnable architectures for representing relevant
interactions between modalities. Through extensive experiments, we show that
BLOCK compares favorably with respect to state-of-the-art multimodal fusion
models for both VQA and VRD tasks. Our code is available at
https://github.com/Cadene/block.bootstrap.pytorch
Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs
Scene recognition with RGB images has been extensively studied and has
reached very remarkable recognition levels, thanks to convolutional neural
networks (CNN) and large scene datasets. In contrast, current RGB-D scene data
is much more limited, so often leverages RGB large datasets, by transferring
pretrained RGB CNN models and fine-tuning with the target RGB-D dataset.
However, we show that this approach has the limitation of hardly reaching
bottom layers, which is key to learn modality-specific features. In contrast,
we focus on the bottom layers, and propose an alternative strategy to learn
depth features combining local weakly supervised training from patches followed
by global fine tuning with images. This strategy is capable of learning very
discriminative depth-specific features with limited depth images, without
resorting to Places-CNN. In addition we propose a modified CNN architecture to
further match the complexity of the model and the amount of data available. For
RGB-D scene recognition, depth and RGB features are combined by projecting them
in a common space and further leaning a multilayer classifier, which is jointly
optimized in an end-to-end network. Our framework achieves state-of-the-art
accuracy on NYU2 and SUN RGB-D in both depth only and combined RGB-D data.Comment: AAAI Conference on Artificial Intelligence 201
- …