90 research outputs found
Collaborative Spatio-temporal Feature Learning for Video Action Recognition
Spatio-temporal feature learning is of central importance for action
recognition in videos. Existing deep neural network models either learn spatial
and temporal features independently (C2D) or jointly with unconstrained
parameters (C3D). In this paper, we propose a novel neural operation which
encodes spatio-temporal features collaboratively by imposing a weight-sharing
constraint on the learnable parameters. In particular, we perform 2D
convolution along three orthogonal views of volumetric video data,which learns
spatial appearance and temporal motion cues respectively. By sharing the
convolution kernels of different views, spatial and temporal features are
collaboratively learned and thus benefit from each other. The complementary
features are subsequently fused by a weighted summation whose coefficients are
learned end-to-end. Our approach achieves state-of-the-art performance on
large-scale benchmarks and won the 1st place in the Moments in Time Challenge
2018. Moreover, based on the learned coefficients of different views, we are
able to quantify the contributions of spatial and temporal features. This
analysis sheds light on interpretability of the model and may also guide the
future design of algorithm for video recognition.Comment: CVPR 201
A Layer Decomposition-Recomposition Framework for Neuron Pruning towards Accurate Lightweight Networks
Neuron pruning is an efficient method to compress the network into a slimmer
one for reducing the computational cost and storage overhead. Most of
state-of-the-art results are obtained in a layer-by-layer optimization mode. It
discards the unimportant input neurons and uses the survived ones to
reconstruct the output neurons approaching to the original ones in a
layer-by-layer manner. However, an unnoticed problem arises that the
information loss is accumulated as layer increases since the survived neurons
still do not encode the entire information as before. A better alternative is
to propagate the entire useful information to reconstruct the pruned layer
instead of directly discarding the less important neurons. To this end, we
propose a novel Layer Decomposition-Recomposition Framework (LDRF) for neuron
pruning, by which each layer's output information is recovered in an embedding
space and then propagated to reconstruct the following pruned layers with
useful information preserved. We mainly conduct our experiments on ILSVRC-12
benchmark with VGG-16 and ResNet-50. What should be emphasized is that our
results before end-to-end fine-tuning are significantly superior owing to the
information-preserving property of our proposed framework.With end-to-end
fine-tuning, we achieve state-of-the-art results of 5.13x and 3x speed-up with
only 0.5% and 0.65% top-5 accuracy drop respectively, which outperform the
existing neuron pruning methods.Comment: accepted by AAAI19 as ora
AON: Towards Arbitrarily-Oriented Text Recognition
Recognizing text from natural images is a hot research topic in computer
vision due to its various applications. Despite the enduring research of
several decades on optical character recognition (OCR), recognizing texts from
natural images is still a challenging task. This is because scene texts are
often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted)
arrangements, which have not yet been well addressed in the literature.
Existing methods on text recognition mainly work with regular (horizontal and
frontal) texts and cannot be trivially generalized to handle irregular texts.
In this paper, we develop the arbitrary orientation network (AON) to directly
capture the deep features of irregular texts, which are combined into an
attention-based decoder to generate character sequence. The whole network can
be trained end-to-end by using only images and word-level annotations.
Extensive experiments on various benchmarks, including the CUTE80,
SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed
AON-based method achieves the-state-of-the-art performance in irregular
datasets, and is comparable to major existing methods in regular datasets.Comment: Accepted by CVPR201
Learned Quality Enhancement via Multi-Frame Priors for HEVC Compliant Low-Delay Applications
Networked video applications, e.g., video conferencing, often suffer from
poor visual quality due to unexpected network fluctuation and limited
bandwidth. In this paper, we have developed a Quality Enhancement Network
(QENet) to reduce the video compression artifacts, leveraging the spatial and
temporal priors generated by respective multi-scale convolutions spatially and
warped temporal predictions in a recurrent fashion temporally. We have
integrated this QENet as a standard-alone post-processing subsystem to the High
Efficiency Video Coding (HEVC) compliant decoder. Experimental results show
that our QENet demonstrates the state-of-the-art performance against default
in-loop filters in HEVC and other deep learning based methods with noticeable
objective gains in Peak-Signal-to-Noise Ratio (PSNR) and subjective gains
visually
- …