1,346 research outputs found

    Adaptive nonlocal and structured sparse signal modeling and applications

    Get PDF
    Features based on sparse representation, especially using the synthesis dictionary model, have been heavily exploited in signal processing and computer vision. Many applications such as image and video denoising, inpainting, demosaicing, super-resolution, magnetic resonance imaging (MRI), and computed tomography (CT) reconstruction have been shown to benefit from adaptive sparse signal modeling. However, synthesis dictionary learning typically involves expensive sparse coding and learning steps. Recently, sparsifying transform learning received interest for its cheap computation and its optimal updates in the alternating algorithms. Prior works on transform learning have certain limitations, including (1) limited model richness and structure for handling diverse data, (2) lack of non-local structure, and (3) lack of effective extension to high-dimensional or streaming data. This dissertation focuses on advanced data-driven sparse modeling techniques, especially with nonlocal and structured sparse signal modeling. In the first work of this dissertation, we propose a methodology for learning, dubbed Flipping and Rotation Invariant Sparsifying Transforms (FRIST), to better represent natural images that contain textures with various geometrical directions. The proposed alternating FRIST learning algorithm involves efficient optimal updates. We provide a convergence guarantee, and demonstrate the empirical convergence behavior of the proposed FRIST learning approach. Preliminary experiments show the promising performance of FRIST learning for image sparse representation, segmentation, denoising, robust inpainting, and compressed sensing-based magnetic resonance image reconstruction. Next, we present an online high-dimensional sparsifying transform learning method for spatio-temporal data, and demonstrate its usefulness with a novel video denoising framework, dubbed VIDOSAT. The proposed method is based on our previous work on online sparsifying transform learning, which has low computational and memory costs, and can potentially handle streaming video. By combining with a block matching (BM) technique, the learned model can effectively adapt to video data with various motions. The patches are constructed either from corresponding 2D patches in successive frames or using an online block matching technique. The proposed online video denoising requires little memory and others efficient processing. Numerical experiments are used to analyze the contribution of the various components of the proposed video denoising scheme by "switching off" these components - for example, fixing the transform to be 3D DCT, rather than a learned transform. Other experiments compare to the performance of prior schemes such as dictionary learning-based schemes, and the state-of-the-art VBM3D and VBM4D on several video data sets, demonstrating the promising performance of the proposed methods. In the third part of the dissertation, we propose a joint sparse and low-rank model, dubbed STROLLR, to better represent natural images. Patch-based methods exploit local patch sparsity, whereas other works apply low-rankness of grouped patches to exploit image non-local structures. However, using either approach alone usually limits performance in image restoration applications. In order to fully utilize both the local and non-local image properties, we develop an image restoration framework using a transform learning scheme with joint low-rank regularization. The approach owes some of its computational efficiency and good performance to the use of transform learning for adaptive sparse representation rather than the popular synthesis dictionary learning algorithms, which involve approximation of NP-hard sparse coding and expensive learning steps. We demonstrate the proposed framework in various applications to image denoising, inpainting, and compressed sensing based magnetic resonance imaging. Results show promising performance compared to state-of-the-art competing methods. Last, we extend the effective joint sparsity and low-rankness model from image to video applications. We propose a novel video denoising method, based on an online tensor reconstruction scheme with a joint adaptive sparse and low-rank model, dubbed SALT. An efficient and unsupervised online unitary sparsifying transform learning method is introduced to impose adaptive sparsity on the fly. We develop an efficient 3D spatio-temporal data reconstruction framework based on the proposed online learning method, which exhibits low latency and can potentially handle streaming videos. To the best of our knowledge, this is the first work that combines adaptive sparsity and low-rankness for video denoising, and the first work that solves the proposed problem in an online fashion. We demonstrate video denoising results over commonly used videos from public datasets. Numerical experiments show that the proposed video denoising method outperforms competing methods

    Latent Semantic Learning with Structured Sparse Representation for Human Action Recognition

    Full text link
    This paper proposes a novel latent semantic learning method for extracting high-level features (i.e. latent semantics) from a large vocabulary of abundant mid-level features (i.e. visual keywords) with structured sparse representation, which can help to bridge the semantic gap in the challenging task of human action recognition. To discover the manifold structure of midlevel features, we develop a spectral embedding approach to latent semantic learning based on L1-graph, without the need to tune any parameter for graph construction as a key step of manifold learning. More importantly, we construct the L1-graph with structured sparse representation, which can be obtained by structured sparse coding with its structured sparsity ensured by novel L1-norm hypergraph regularization over mid-level features. In the new embedding space, we learn latent semantics automatically from abundant mid-level features through spectral clustering. The learnt latent semantics can be readily used for human action recognition with SVM by defining a histogram intersection kernel. Different from the traditional latent semantic analysis based on topic models, our latent semantic learning method can explore the manifold structure of mid-level features in both L1-graph construction and spectral embedding, which results in compact but discriminative high-level features. The experimental results on the commonly used KTH action dataset and unconstrained YouTube action dataset show the superior performance of our method.Comment: The short version of this paper appears in ICCV 201

    Unmanned aerial vehicle video-based target tracking algorithm Using sparse representation

    Get PDF
    Target tracking based on unmanned aerial vehicle (UAV) video is a significant technique in intelligent urban surveillance systems for smart city applications, such as smart transportation, road traffic monitoring, inspection of stolen vehicle, etc. In this paper, a vision-based target tracking algorithm aiming at locating UAV-captured targets, like pedestrian and vehicle, is proposed using sparse representation theory. First of all, each target candidate is sparsely represented in the subspace spanned by a joint dictionary. Then, the sparse representation coefficient is further constrained by an L2 regularization based on the temporal consistency. To cope with the partial occlusion appearing in UAV videos, a Markov Random Field (MRF)-based binary support vector with contiguous occlusion constraint is introduced to our sparse representation model. For long-term tracking, the particle filter framework along with a dynamic template update scheme is designed. Both qualitative and quantitative experiments implemented on visible (Vis) and infrared (IR) UAV videos prove that the presented tracker can achieve better performances in terms of precision rate and success rate when compared with other state-of-the-art tracker

    Two-stage sparse representation based abnormal crowd event detection in videos

    Get PDF
    Ubiquitous surveillance has become part of our lives to increase security and safety. Despite the wide application of surveillance systems, their efficiency is limited by human factors, such as boredom and fatigue; because most of the time, nothing unusual happens. In safety-critical applications, time is essential and it is vital to act fast to prevent costly incidents. This thesis proposes a two-stage abnormal crowd event detection framework based on k-means clustering in the first stage, and sparse representation based methods in the second stage, to alleviate the laborious task of video monitoring. We conduct a literature review of 18 studies, where we specifically focus on sparse representation based methods. Accordingly, we choose the spatio-temporal gradient feature due to its simplicity, efficiency, and effectiveness in motion representation. After extracting features only from normal events, k-means clustering is applied to separate different motion feature clusters. Then, clusters with smaller samples, which are deemed to contain mostly abnormal features, are removed according to a threshold. In the second stage, we learn a dictionary for each remaining cluster using the approximate K-SVD algorithm. In testing, the reconstruction error of a feature against a learned dictionary and its sparse representation is used to determine an abnormality. We conduct extensive experiments on a standard dataset to evaluate the detection performance of the method. Furthermore, the effect of hyper-parameters in our method is investigated. We also compare our method with different methods to examine its effectiveness. Results indicate that our abnormal event detection framework can successfully understand abnormal events in a scene while running in real-time at 161 frames per second. With a few exceptions, no significant advantage of the two-stage sparse representation approach over a single large dictionary was found. We speculate that these results may be influenced by a small sample size. Nevertheless, our approach, due to its unsupervised nature, can be adapted to different contexts without additional annotation effort and using only normal events from videos. Therefore it motivates us for further development

    Video anomaly detection and localization by local motion based joint video representation and OCELM

    Get PDF
    Nowadays, human-based video analysis becomes increasingly exhausting due to the ubiquitous use of surveillance cameras and explosive growth of video data. This paper proposes a novel approach to detect and localize video anomalies automatically. For video feature extraction, video volumes are jointly represented by two novel local motion based video descriptors, SL-HOF and ULGP-OF. SL-HOF descriptor captures the spatial distribution information of 3D local regions’ motion in the spatio-temporal cuboid extracted from video, which can implicitly reflect the structural information of foreground and depict foreground motion more precisely than the normal HOF descriptor. To locate the video foreground more accurately, we propose a new Robust PCA based foreground localization scheme. ULGP-OF descriptor, which seamlessly combines the classic 2D texture descriptor LGP and optical flow, is proposed to describe the motion statistics of local region texture in the areas located by the foreground localization scheme. Both SL-HOF and ULGP-OF are shown to be more discriminative than existing video descriptors in anomaly detection. To model features of normal video events, we introduce the newly-emergent one-class Extreme Learning Machine (OCELM) as the data description algorithm. With a tremendous reduction in training time, OCELM can yield comparable or better performance than existing algorithms like the classic OCSVM, which makes our approach easier for model updating and more applicable to fast learning from the rapidly generated surveillance data. The proposed approach is tested on UCSD ped1, ped2 and UMN datasets, and experimental results show that our approach can achieve state-of-the-art results in both video anomaly detection and localization task.This work was supported by the National Natural Science Foundation of China (Project nos. 60970034, 61170287, 61232016)

    Learned Spatio-Temporal Texture Descriptors for RGB-D Human Action Recognition

    Get PDF
    Due to the recent arrival of Kinect, action recognition with depth images has attracted researchers' wide attentions and various descriptors have been proposed, where Local Binary Patterns (LBP) texture descriptors possess the properties of appearance invariance. However, the LBP and its variants are most artificially-designed, demanding engineers' strong prior knowledge and not discriminative enough for recognition tasks. To this end, this paper develops compact spatio-temporal texture descriptors, i.e. 3D-compact LBP (3D-CLBP) and local depth patterns (3D-CLDP), for color and depth videos in the light of compact binary face descriptor learning in face recognition. Extensive experiments performed on three standard datasets, 3D Online Action, MSR Action Pairs and MSR Daily Activity 3D, demonstrate that our method is superior to most comparative methods in respects of performance and can capture spatial-temporal texture cues in videos
    • …
    corecore