566 research outputs found

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

    Interactive videos: Plausible video editing using sparse structure points

    Get PDF
    Video remains the method of choice for capturing temporal events. However, without access to the underlying 3D scene models, it remains difficult to make object level edits in a single video or across multiple videos. While it may be possible to explicitly reconstruct the 3D geometries to facilitate these edits, such a workflow is cumbersome, expensive, and tedious. In this work, we present a much simpler workflow to create plausible editing and mixing of raw video footage using only sparse structure points (SSP) directly recovered from the raw sequences. First, we utilize user-scribbles to structure the point representations obtained using structure-from-motion on the input videos. The resultant structure points, even when noisy and sparse, are then used to enable various video edits in 3D, including view perturbation, keyframe animation, object duplication and transfer across videos, etc. Specifically, we describe how to synthesize object images from new views adopting a novel image-based rendering technique using the SSPs as proxy for the missing 3D scene information. We propose a structure-preserving image warping on multiple input frames adaptively selected from object video, followed by a spatio-temporally coherent image stitching to compose the final object image. Simple planar shadows and depth maps are synthesized for objects to generate plausible video sequence mimicking real-world interactions. We demonstrate our system on a variety of input videos to produce complex edits, which are otherwise difficult to achieve

    Motion-Aware Gradient Domain Video Composition

    Full text link

    Denoising time-resolved microscopy image sequences with singular value thresholding.

    Get PDF
    Time-resolved imaging in microscopy is important for the direct observation of a range of dynamic processes in both the physical and life sciences. However, the image sequences are often corrupted by noise, either as a result of high frame rates or a need to limit the radiation dose received by the sample. Here we exploit both spatial and temporal correlations using low-rank matrix recovery methods to denoise microscopy image sequences. We also make use of an unbiased risk estimator to address the issue of how much thresholding to apply in a robust and automated manner. The performance of the technique is demonstrated using simulated image sequences, as well as experimental scanning transmission electron microscopy data, where surface adatom motion and nanoparticle structural dynamics are recovered at rates of up to 32 frames per second.Junior Research Fellowship from Clare CollegeThis is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.ultramic.2016.05.00

    How Not to Be Seen -- Inpainting Dynamic Objects in Crowded Scenes

    No full text
    Removing dynamic objects from videos is an extremely challenging problem that even visual effects professionals often solve with time-consuming manual frame-by-frame editing. We propose a new approach to video completion that can deal with complex scenes containing dynamic background and non-periodical moving objects. We build upon the idea that the spatio-temporal hole left by a removed object can be filled with data available on other regions of the video where the occluded objects were visible. Video completion is performed by solving a large combinatorial problem that searches for an optimal pattern of pixel offsets from occluded to unoccluded regions. Our contribution includes an energy functional that generalizes well over different scenes with stable parameters, and that has the desirable convergence properties for a graph-cut-based optimization. We provide an interface to guide the completion process that both reduces computation time and allows for efficient correction of small errors in the result. We demonstrate that our approach can effectively complete complex, high-resolution occlusions that are greater in difficulty than what existing methods have shown

    Action Recognition: From Static Datasets to Moving Robots

    Get PDF
    Deep learning models have achieved state-of-the- art performance in recognizing human activities, but often rely on utilizing background cues present in typical computer vision datasets that predominantly have a stationary camera. If these models are to be employed by autonomous robots in real world environments, they must be adapted to perform independently of background cues and camera motion effects. To address these challenges, we propose a new method that firstly generates generic action region proposals with good potential to locate one human action in unconstrained videos regardless of camera motion and then uses action proposals to extract and classify effective shape and motion features by a ConvNet framework. In a range of experiments, we demonstrate that by actively proposing action regions during both training and testing, state-of-the-art or better performance is achieved on benchmarks. We show the outperformance of our approach compared to the state-of-the-art in two new datasets; one emphasizes on irrelevant background, the other highlights the camera motion. We also validate our action recognition method in an abnormal behavior detection scenario to improve workplace safety. The results verify a higher success rate for our method due to the ability of our system to recognize human actions regardless of environment and camera motion

    AUTOMATED ESTIMATION, REDUCTION, AND QUALITY ASSESSMENT OF VIDEO NOISE FROM DIFFERENT SOURCES

    Get PDF
    Estimating and removing noise from video signals is important to increase either the visual quality of video signals or the performance of video processing algorithms such as compression or segmentation where noise estimation or reduction is a pre-processing step. To estimate and remove noise, effective methods use both spatial and temporal information to increase the reliability of signal extraction from noise. The objective of this thesis is to introduce a video system having three novel techniques to estimate and reduce video noise from different sources, both effectively and efficiently and assess video quality without considering a reference non-noisy video. The first (intensity-variances based homogeneity classification) technique estimates visual noise of different types in images and video signals. The noise can be white Gaussian noise, mixed Poissonian- Gaussian (signal-dependent white) noise, or processed (frequency-dependent) noise. The method is based on the classification of intensity-variances of signal patches in order to find homogeneous regions that best represent the noise signal in the input signal. The method assumes that noise is signal-independent in each intensity class. To find homogeneous regions, the method works on the downsampled input image and divides it into patches. Each patch is assigned to an intensity class, whereas outlier patches are rejected. Then the most homogeneous cluster is selected and its noise variance is considered as the peak of noise variance. To account for processed noise, we estimate the degree of spatial correlation. To account for temporal noise variations a stabilization process is proposed. We show that the proposed method competes related state-of-the-art in noise estimation. The second technique provides solutions to remove real-world camera noise such as signal-independent, signal-dependent noise, and frequency-dependent noise. Firstly, we propose a noise equalization method in intensity and frequency domain which enables a white Gaussian noise filter to handle real noise. Our experiments confirm the quality improvement under real noise while white Gaussian noise filter is used with our equalization method. Secondly, we propose a band-limited time-space video denoiser which reduces video noise of different types. This denoiser consists of: 1) intensity-domain noise equalization to account for signal dependency, 2) band-limited anti-blocking time-domain filtering of current frame using motion-compensated previous and subsequent frames, 3) spatial filtering combined with noise frequency equalizer to remove residual noise left from temporal filtering, and 4) intensity de-equalization to invert the first step. To decrease the chance of motion blur, temporal weights are calculated using two levels of error estimation; coarse (blocklevel) and fine (pixel-level). We correct the erroneous motion vectors by creating a homography from reliable motion vectors. To eliminate blockiness in block-based temporal filter, we propose three ideas: interpolation of block-level error, a band-limited filtering by subtracting the back-signal beforehand, and two-band motion compensation. The proposed time-space filter is parallelizable to be significantly accelerated by GPU. We show that the proposed method competes related state-ofthe- art in video denoising. The third (sparsity and dominant orientation quality index) technique is a new method to assess the quality of the denoised video frames without a reference (clean frames). In many image and video applications, a quantitative measure of image content, noise, and blur is required to facilitate quality assessment, when the ground-truth is not available. We propose a fast method to find the dominant orientation of image patches, which is used to decompose them into singular values. Combining singular values with the sparsity of the patch in the transform domain, we measure the possible image content and noise of the patches and of the whole image. To measure the effect of noise accurately, our method takes both low and high textured patches into account. Before analyzing the patches, we apply a shrinkage in the transform domain to increase the contrast of genuine image structure. We show that the proposed method is useful to select parameters of denoising algorithms automatically in different noise scenarios such as white Gaussian and real noise. Our objective and subjective results confirm the correspondence between the measured quality and the ground-truth and proposed method rivals related state-of-the-art approaches

    Spatial Pyramid Context-Aware Moving Object Detection and Tracking for Full Motion Video and Wide Aerial Motion Imagery

    Get PDF
    A robust and fast automatic moving object detection and tracking system is essential to characterize target object and extract spatial and temporal information for different functionalities including video surveillance systems, urban traffic monitoring and navigation, robotic. In this dissertation, I present a collaborative Spatial Pyramid Context-aware moving object detection and Tracking system. The proposed visual tracker is composed of one master tracker that usually relies on visual object features and two auxiliary trackers based on object temporal motion information that will be called dynamically to assist master tracker. SPCT utilizes image spatial context at different level to make the video tracking system resistant to occlusion, background noise and improve target localization accuracy and robustness. We chose a pre-selected seven-channel complementary features including RGB color, intensity and spatial pyramid of HoG to encode object color, shape and spatial layout information. We exploit integral histogram as building block to meet the demands of real-time performance. A novel fast algorithm is presented to accurately evaluate spatially weighted local histograms in constant time complexity using an extension of the integral histogram method. Different techniques are explored to efficiently compute integral histogram on GPU architecture and applied for fast spatio-temporal median computations and 3D face reconstruction texturing. We proposed a multi-component framework based on semantic fusion of motion information with projected building footprint map to significantly reduce the false alarm rate in urban scenes with many tall structures. The experiments on extensive VOTC2016 benchmark dataset and aerial video confirm that combining complementary tracking cues in an intelligent fusion framework enables persistent tracking for Full Motion Video and Wide Aerial Motion Imagery.Comment: PhD Dissertation (162 pages
    corecore