422 research outputs found

    Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

    Full text link
    Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level semantic features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).Comment: ECCV 2018 camera read

    Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition

    Full text link
    Motion representation plays a vital role in human action recognition in videos. In this study, we introduce a novel compact motion representation for video action recognition, named Optical Flow guided Feature (OFF), which enables the network to distill temporal information through a fast and robust approach. The OFF is derived from the definition of optical flow and is orthogonal to the optical flow. The derivation also provides theoretical support for using the difference between two frames. By directly calculating pixel-wise spatiotemporal gradients of the deep feature maps, the OFF could be embedded in any existing CNN based video action recognition framework with only a slight additional cost. It enables the CNN to extract spatiotemporal information, especially the temporal information between frames simultaneously. This simple but powerful idea is validated by experimental results. The network with OFF fed only by RGB inputs achieves a competitive accuracy of 93.3% on UCF-101, which is comparable with the result obtained by two streams (RGB and optical flow), but is 15 times faster in speed. Experimental results also show that OFF is complementary to other motion modalities such as optical flow. When the proposed method is plugged into the state-of-the-art video action recognition framework, it has 96:0% and 74:2% accuracy on UCF-101 and HMDB-51 respectively. The code for this project is available at https://github.com/kevin-ssy/Optical-Flow-Guided-Feature.Comment: CVPR 2018. code available at https://github.com/kevin-ssy/Optical-Flow-Guided-Featur

    Deep learning for feature extraction in remote sensing: A case-study of aerial scene classification

    Get PDF
    Scene classification relying on images is essential in many systems and applications related to remote sensing. The scientific interest in scene classification from remotely collected images is increasing, and many datasets and algorithms are being developed. The introduction of convolutional neural networks (CNN) and other deep learning techniques contributed to vast improvements in the accuracy of image scene classification in such systems. To classify the scene from areal images, we used a two-stream deep architecture. We performed the first part of the classification, the feature extraction, using pre-trained CNN that extracts deep features of aerial images from different network layers: the average pooling layer or some of the previous convolutional layers. Next, we applied feature concatenation on extracted features from various neural networks, after dimensionality reduction was performed on enormous feature vectors. We experimented extensively with different CNN architectures, to get optimal results. Finally, we used the Support Vector Machine (SVM) for the classification of the concatenated features. The competitiveness of the examined technique was evaluated on two real-world datasets: UC Merced and WHU-RS. The obtained classification accuracies demonstrate that the considered method has competitive results compared to other cutting-edge techniques
    • …
    corecore