8,984 research outputs found
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Spatio-Temporal Action Detection with Cascade Proposal and Location Anticipation
In this work, we address the problem of spatio-temporal action detection in
temporally untrimmed videos. It is an important and challenging task as finding
accurate human actions in both temporal and spatial space is important for
analyzing large-scale video data. To tackle this problem, we propose a cascade
proposal and location anticipation (CPLA) model for frame-level action
detection. There are several salient points of our model: (1) a cascade region
proposal network (casRPN) is adopted for action proposal generation and shows
better localization accuracy compared with single region proposal network
(RPN); (2) action spatio-temporal consistencies are exploited via a location
anticipation network (LAN) and thus frame-level action detection is not
conducted independently. Frame-level detections are then linked by solving an
linking score maximization problem, and temporally trimmed into spatio-temporal
action tubes. We demonstrate the effectiveness of our model on the challenging
UCF101 and LIRIS-HARL datasets, both achieving state-of-the-art performance.Comment: Accepted at BMVC 2017 (oral
Indoor Semantic Segmentation using depth information
This work addresses multi-class segmentation of indoor scenes with RGB-D
inputs. While this area of research has gained much attention recently, most
works still rely on hand-crafted features. In contrast, we apply a multiscale
convolutional network to learn features directly from the images and the depth
information. We obtain state-of-the-art on the NYU-v2 depth dataset with an
accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos
sequences that could be processed in real-time using appropriate hardware such
as an FPGA.Comment: 8 pages, 3 figure
A Neural System for Automated CCTV Surveillance
This paper overviews a new system, the âOwens
Tracker,â for automated identification of suspicious
pedestrian activity in a car-park.
Centralized CCTV systems relay multiple video streams
to a central point for monitoring by an operator. The
operator receives a continuous stream of information,
mostly related to normal activity, making it difficult to
maintain concentration at a sufficiently high level.
While it is difficult to place quantitative boundaries on
the number of scenes and time period over which
effective monitoring can be performed, Wallace and
Diffley [1] give some guidance, based on empirical and
anecdotal evidence, suggesting that the number of
cameras monitored by an operator be no greater than 16,
and that the period of effective monitoring may be as
low as 30 minutes before recuperation is required.
An intelligent video surveillance system should
therefore act as a filter, censuring inactive scenes and
scenes showing normal activity. By presenting the
operator only with unusual activity his/her attention is
effectively focussed, and the ratio of cameras to
operators can be increased.
The Owens Tracker learns to recognize environmentspecific
normal behaviour, and refers sequences of
unusual behaviour for operator attention. The system
was developed using standard low-resolution CCTV
cameras operating in the car-parks of Doxford Park
Industrial Estate (Sunderland, Tyne and Wear), and
targets unusual pedestrian behaviour.
The modus operandi of the system is to highlight
excursions from a learned model of normal behaviour in
the monitored scene. The system tracks objects and
extracts their centroids; behaviour is defined as the
trajectory traced by an object centroid; normality as the
trajectories typically encountered in the scene. The
essential stages in the system are: segmentation of
objects of interest; disambiguation and tracking of
multiple contacts, including the handling of occlusion
and noise, and successful tracking of objects that
âmergeâ during motion; identification of unusual
trajectories. These three stages are discussed in more
detail in the following sections, and the system
performance is then evaluated
Semantic analysis of field sports video using a petri-net of audio-visual concepts
The most common approach to automatic summarisation and highlight detection in sports video is to train an automatic classifier to detect semantic highlights based on occurrences of low-level features such as action replays, excited commentators or changes in a scoreboard. We propose an alternative approach based on the detection of perception concepts (PCs) and the construction of Petri-Nets which can be used for both semantic description and event detection within sports videos. Low-level algorithms for the detection of perception concepts using visual, aural and motion characteristics are proposed, and a series of Petri-Nets composed of perception concepts is formally defined to describe video content. We call this a Perception Concept Network-Petri Net (PCN-PN) model. Using PCN-PNs, personalized high-level semantic descriptions of video highlights can be facilitated and queries on high-level semantics can be achieved. A particular strength of this framework is that we can easily build semantic detectors based on PCN-PNs to search within sports videos and locate interesting events. Experimental results based on recorded sports
video data across three types of sports games (soccer, basketball and rugby), and each from multiple broadcasters, are used to illustrate the potential of this framework
Learning Behavioural Context
The original publication is available at www.springerlink.co
- âŠ