8,984 research outputs found

    Action Recognition in Videos: from Motion Capture Labs to the Web

    Full text link
    This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

    Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation

    Full text link
    In this work, we address the problem of spatio-temporal action detection in temporally untrimmed videos. It is an important and challenging task as finding accurate human actions in both temporal and spatial space is important for analyzing large-scale video data. To tackle this problem, we propose a cascade proposal and location anticipation (CPLA) model for frame-level action detection. There are several salient points of our model: (1) a cascade region proposal network (casRPN) is adopted for action proposal generation and shows better localization accuracy compared with single region proposal network (RPN); (2) action spatio-temporal consistencies are exploited via a location anticipation network (LAN) and thus frame-level action detection is not conducted independently. Frame-level detections are then linked by solving an linking score maximization problem, and temporally trimmed into spatio-temporal action tubes. We demonstrate the effectiveness of our model on the challenging UCF101 and LIRIS-HARL datasets, both achieving state-of-the-art performance.Comment: Accepted at BMVC 2017 (oral

    Indoor Semantic Segmentation using depth information

    Full text link
    This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. We obtain state-of-the-art on the NYU-v2 depth dataset with an accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos sequences that could be processed in real-time using appropriate hardware such as an FPGA.Comment: 8 pages, 3 figure

    A Neural System for Automated CCTV Surveillance

    Get PDF
    This paper overviews a new system, the “Owens Tracker,” for automated identification of suspicious pedestrian activity in a car-park. Centralized CCTV systems relay multiple video streams to a central point for monitoring by an operator. The operator receives a continuous stream of information, mostly related to normal activity, making it difficult to maintain concentration at a sufficiently high level. While it is difficult to place quantitative boundaries on the number of scenes and time period over which effective monitoring can be performed, Wallace and Diffley [1] give some guidance, based on empirical and anecdotal evidence, suggesting that the number of cameras monitored by an operator be no greater than 16, and that the period of effective monitoring may be as low as 30 minutes before recuperation is required. An intelligent video surveillance system should therefore act as a filter, censuring inactive scenes and scenes showing normal activity. By presenting the operator only with unusual activity his/her attention is effectively focussed, and the ratio of cameras to operators can be increased. The Owens Tracker learns to recognize environmentspecific normal behaviour, and refers sequences of unusual behaviour for operator attention. The system was developed using standard low-resolution CCTV cameras operating in the car-parks of Doxford Park Industrial Estate (Sunderland, Tyne and Wear), and targets unusual pedestrian behaviour. The modus operandi of the system is to highlight excursions from a learned model of normal behaviour in the monitored scene. The system tracks objects and extracts their centroids; behaviour is defined as the trajectory traced by an object centroid; normality as the trajectories typically encountered in the scene. The essential stages in the system are: segmentation of objects of interest; disambiguation and tracking of multiple contacts, including the handling of occlusion and noise, and successful tracking of objects that “merge” during motion; identification of unusual trajectories. These three stages are discussed in more detail in the following sections, and the system performance is then evaluated

    Semantic analysis of field sports video using a petri-net of audio-visual concepts

    Get PDF
    The most common approach to automatic summarisation and highlight detection in sports video is to train an automatic classifier to detect semantic highlights based on occurrences of low-level features such as action replays, excited commentators or changes in a scoreboard. We propose an alternative approach based on the detection of perception concepts (PCs) and the construction of Petri-Nets which can be used for both semantic description and event detection within sports videos. Low-level algorithms for the detection of perception concepts using visual, aural and motion characteristics are proposed, and a series of Petri-Nets composed of perception concepts is formally defined to describe video content. We call this a Perception Concept Network-Petri Net (PCN-PN) model. Using PCN-PNs, personalized high-level semantic descriptions of video highlights can be facilitated and queries on high-level semantics can be achieved. A particular strength of this framework is that we can easily build semantic detectors based on PCN-PNs to search within sports videos and locate interesting events. Experimental results based on recorded sports video data across three types of sports games (soccer, basketball and rugby), and each from multiple broadcasters, are used to illustrate the potential of this framework
    • 

    corecore