6,953 research outputs found
Extrinsic Methods for Coding and Dictionary Learning on Grassmann Manifolds
Sparsity-based representations have recently led to notable results in
various visual recognition tasks. In a separate line of research, Riemannian
manifolds have been shown useful for dealing with features and models that do
not lie in Euclidean spaces. With the aim of building a bridge between the two
realms, we address the problem of sparse coding and dictionary learning over
the space of linear subspaces, which form Riemannian structures known as
Grassmann manifolds. To this end, we propose to embed Grassmann manifolds into
the space of symmetric matrices by an isometric mapping. This in turn enables
us to extend two sparse coding schemes to Grassmann manifolds. Furthermore, we
propose closed-form solutions for learning a Grassmann dictionary, atom by
atom. Lastly, to handle non-linearity in data, we extend the proposed Grassmann
sparse coding and dictionary learning algorithms through embedding into Hilbert
spaces.
Experiments on several classification tasks (gender recognition, gesture
classification, scene analysis, face recognition, action recognition and
dynamic texture classification) show that the proposed approaches achieve
considerable improvements in discrimination accuracy, in comparison to
state-of-the-art methods such as kernelized Affine Hull Method and
graph-embedding Grassmann discriminant analysis.Comment: Appearing in International Journal of Computer Visio
Video Storytelling: Textual Summaries for Events
Bridging vision and natural language is a longstanding goal in computer
vision and multimedia research. While earlier works focus on generating a
single-sentence description for visual content, recent works have studied
paragraph generation. In this work, we introduce the problem of video
storytelling, which aims at generating coherent and succinct stories for long
videos. Video storytelling introduces new challenges, mainly due to the
diversity of the story and the length and complexity of the video. We propose
novel methods to address the challenges. First, we propose a context-aware
framework for multimodal embedding learning, where we design a Residual
Bidirectional Recurrent Neural Network to leverage contextual information from
past and future. Second, we propose a Narrator model to discover the underlying
storyline. The Narrator is formulated as a reinforcement learning agent which
is trained by directly optimizing the textual metric of the generated story. We
evaluate our method on the Video Story dataset, a new dataset that we have
collected to enable the study. We compare our method with multiple
state-of-the-art baselines, and show that our method achieves better
performance, in terms of quantitative measures and user study.Comment: Published in IEEE Transactions on Multimedi
Self-Attentive Pooling for Efficient Deep Learning
Efficient custom pooling techniques that can aggressively trim the dimensions
of a feature map and thereby reduce inference compute and memory footprint for
resource-constrained computer vision applications have recently gained
significant traction. However, prior pooling works extract only the local
context of the activation maps, limiting their effectiveness. In contrast, we
propose a novel non-local self-attentive pooling method that can be used as a
drop-in replacement to the standard pooling layers, such as max/average pooling
or strided convolution. The proposed self-attention module uses patch
embedding, multi-head self-attention, and spatial-channel restoration, followed
by sigmoid activation and exponential soft-max. This self-attention mechanism
efficiently aggregates dependencies between non-local activation patches during
down-sampling. Extensive experiments on standard object classification and
detection tasks with various convolutional neural network (CNN) architectures
demonstrate the superiority of our proposed mechanism over the state-of-the-art
(SOTA) pooling techniques. In particular, we surpass the test accuracy of
existing pooling techniques on different variants of MobileNet-V2 on ImageNet
by an average of 1.2%. With the aggressive down-sampling of the activation maps
in the initial layers (providing up to 22x reduction in memory consumption),
our approach achieves 1.43% higher test accuracy compared to SOTA techniques
with iso-memory footprints. This enables the deployment of our models in
memory-constrained devices, such as micro-controllers (without losing
significant accuracy), because the initial activation maps consume a
significant amount of on-chip memory for high-resolution images required for
complex vision tasks. Our proposed pooling method also leverages the idea of
channel pruning to further reduce memory footprints.Comment: 9 pages, 4 figures, conferenc
- …