2,609 research outputs found

    Two-Stream Action Recognition-Oriented Video Super-Resolution

    Full text link
    We study the video super-resolution (SR) problem for facilitating video analytics tasks, e.g. action recognition, instead of for visual quality. The popular action recognition methods based on convolutional networks, exemplified by two-stream networks, are not directly applicable on video of low spatial resolution. This can be remedied by performing video SR prior to recognition, which motivates us to improve the SR procedure for recognition accuracy. Tailored for two-stream action recognition networks, we propose two video SR methods for the spatial and temporal streams respectively. On the one hand, we observe that regions with action are more important to recognition, and we propose an optical-flow guided weighted mean-squared-error loss for our spatial-oriented SR (SoSR) network to emphasize the reconstruction of moving objects. On the other hand, we observe that existing video SR methods incur temporal discontinuity between frames, which also worsens the recognition accuracy, and we propose a siamese network for our temporal-oriented SR (ToSR) training that emphasizes the temporal continuity between consecutive frames. We perform experiments using two state-of-the-art action recognition networks and two well-known datasets--UCF101 and HMDB51. Results demonstrate the effectiveness of our proposed SoSR and ToSR in improving recognition accuracy.Comment: Accepted to ICCV 2019. Code: https://github.com/AlanZhang1995/TwoStreamS

    Efficient Human Activity Recognition in Large Image and Video Databases

    Get PDF
    Vision-based human action recognition has attracted considerable interest in recent research for its applications to video surveillance, content-based search, healthcare, and interactive games. Most existing research deals with building informative feature descriptors, designing efficient and robust algorithms, proposing versatile and challenging datasets, and fusing multiple modalities. Often, these approaches build on certain conventions such as the use of motion cues to determine video descriptors, application of off-the-shelf classifiers, and single-factor classification of videos. In this thesis, we deal with important but overlooked issues such as efficiency, simplicity, and scalability of human activity recognition in different application scenarios: controlled video environment (e.g.~indoor surveillance), unconstrained videos (e.g.~YouTube), depth or skeletal data (e.g.~captured by Kinect), and person images (e.g.~Flicker). In particular, we are interested in answering questions like (a) is it possible to efficiently recognize human actions in controlled videos without temporal cues? (b) given that the large-scale unconstrained video data are often of high dimension low sample size (HDLSS) nature, how to efficiently recognize human actions in such data? (c) considering the rich 3D motion information available from depth or motion capture sensors, is it possible to recognize both the actions and the actors using only the motion dynamics of underlying activities? and (d) can motion information from monocular videos be used for automatically determining saliency regions for recognizing actions in still images

    Adaptive nonlocal and structured sparse signal modeling and applications

    Get PDF
    Features based on sparse representation, especially using the synthesis dictionary model, have been heavily exploited in signal processing and computer vision. Many applications such as image and video denoising, inpainting, demosaicing, super-resolution, magnetic resonance imaging (MRI), and computed tomography (CT) reconstruction have been shown to benefit from adaptive sparse signal modeling. However, synthesis dictionary learning typically involves expensive sparse coding and learning steps. Recently, sparsifying transform learning received interest for its cheap computation and its optimal updates in the alternating algorithms. Prior works on transform learning have certain limitations, including (1) limited model richness and structure for handling diverse data, (2) lack of non-local structure, and (3) lack of effective extension to high-dimensional or streaming data. This dissertation focuses on advanced data-driven sparse modeling techniques, especially with nonlocal and structured sparse signal modeling. In the first work of this dissertation, we propose a methodology for learning, dubbed Flipping and Rotation Invariant Sparsifying Transforms (FRIST), to better represent natural images that contain textures with various geometrical directions. The proposed alternating FRIST learning algorithm involves efficient optimal updates. We provide a convergence guarantee, and demonstrate the empirical convergence behavior of the proposed FRIST learning approach. Preliminary experiments show the promising performance of FRIST learning for image sparse representation, segmentation, denoising, robust inpainting, and compressed sensing-based magnetic resonance image reconstruction. Next, we present an online high-dimensional sparsifying transform learning method for spatio-temporal data, and demonstrate its usefulness with a novel video denoising framework, dubbed VIDOSAT. The proposed method is based on our previous work on online sparsifying transform learning, which has low computational and memory costs, and can potentially handle streaming video. By combining with a block matching (BM) technique, the learned model can effectively adapt to video data with various motions. The patches are constructed either from corresponding 2D patches in successive frames or using an online block matching technique. The proposed online video denoising requires little memory and others efficient processing. Numerical experiments are used to analyze the contribution of the various components of the proposed video denoising scheme by "switching off" these components - for example, fixing the transform to be 3D DCT, rather than a learned transform. Other experiments compare to the performance of prior schemes such as dictionary learning-based schemes, and the state-of-the-art VBM3D and VBM4D on several video data sets, demonstrating the promising performance of the proposed methods. In the third part of the dissertation, we propose a joint sparse and low-rank model, dubbed STROLLR, to better represent natural images. Patch-based methods exploit local patch sparsity, whereas other works apply low-rankness of grouped patches to exploit image non-local structures. However, using either approach alone usually limits performance in image restoration applications. In order to fully utilize both the local and non-local image properties, we develop an image restoration framework using a transform learning scheme with joint low-rank regularization. The approach owes some of its computational efficiency and good performance to the use of transform learning for adaptive sparse representation rather than the popular synthesis dictionary learning algorithms, which involve approximation of NP-hard sparse coding and expensive learning steps. We demonstrate the proposed framework in various applications to image denoising, inpainting, and compressed sensing based magnetic resonance imaging. Results show promising performance compared to state-of-the-art competing methods. Last, we extend the effective joint sparsity and low-rankness model from image to video applications. We propose a novel video denoising method, based on an online tensor reconstruction scheme with a joint adaptive sparse and low-rank model, dubbed SALT. An efficient and unsupervised online unitary sparsifying transform learning method is introduced to impose adaptive sparsity on the fly. We develop an efficient 3D spatio-temporal data reconstruction framework based on the proposed online learning method, which exhibits low latency and can potentially handle streaming videos. To the best of our knowledge, this is the first work that combines adaptive sparsity and low-rankness for video denoising, and the first work that solves the proposed problem in an online fashion. We demonstrate video denoising results over commonly used videos from public datasets. Numerical experiments show that the proposed video denoising method outperforms competing methods
    • …
    corecore