14,000 research outputs found
NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding
Research on depth-based human activity analysis achieved outstanding
performance and demonstrated the effectiveness of 3D representation for action
recognition. The existing depth-based and RGB+D-based action recognition
benchmarks have a number of limitations, including the lack of large-scale
training samples, realistic number of distinct class categories, diversity in
camera views, varied environmental conditions, and variety of human subjects.
In this work, we introduce a large-scale dataset for RGB+D human action
recognition, which is collected from 106 distinct subjects and contains more
than 114 thousand video samples and 8 million frames. This dataset contains 120
different action classes including daily, mutual, and health-related
activities. We evaluate the performance of a series of existing 3D activity
analysis methods on this dataset, and show the advantage of applying deep
learning methods for 3D-based human action recognition. Furthermore, we
investigate a novel one-shot 3D activity recognition problem on our dataset,
and a simple yet effective Action-Part Semantic Relevance-aware (APSR)
framework is proposed for this task, which yields promising results for
recognition of the novel action classes. We believe the introduction of this
large-scale dataset will enable the community to apply, adapt, and develop
various data-hungry learning techniques for depth-based and RGB+D-based human
activity understanding. [The dataset is available at:
http://rose1.ntu.edu.sg/Datasets/actionRecognition.asp]Comment: IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI
Energy Consumption Of Visual Sensor Networks: Impact Of Spatio-Temporal Coverage
Wireless visual sensor networks (VSNs) are expected to play a major role in
future IEEE 802.15.4 personal area networks (PAN) under recently-established
collision-free medium access control (MAC) protocols, such as the IEEE
802.15.4e-2012 MAC. In such environments, the VSN energy consumption is
affected by the number of camera sensors deployed (spatial coverage), as well
as the number of captured video frames out of which each node processes and
transmits data (temporal coverage). In this paper, we explore this aspect for
uniformly-formed VSNs, i.e., networks comprising identical wireless visual
sensor nodes connected to a collection node via a balanced cluster-tree
topology, with each node producing independent identically-distributed
bitstream sizes after processing the video frames captured within each network
activation interval. We derive analytic results for the energy-optimal
spatio-temporal coverage parameters of such VSNs under a-priori known bounds
for the number of frames to process per sensor and the number of nodes to
deploy within each tier of the VSN. Our results are parametric to the
probability density function characterizing the bitstream size produced by each
node and the energy consumption rates of the system of interest. Experimental
results reveal that our analytic results are always within 7% of the energy
consumption measurements for a wide range of settings. In addition, results
obtained via a multimedia subsystem show that the optimal spatio-temporal
settings derived by the proposed framework allow for substantial reduction of
energy consumption in comparison to ad-hoc settings. As such, our analytic
modeling is useful for early-stage studies of possible VSN deployments under
collision-free MAC protocols prior to costly and time-consuming experiments in
the field.Comment: to appear in IEEE Transactions on Circuits and Systems for Video
Technology, 201
Large-Scale Mapping of Human Activity using Geo-Tagged Videos
This paper is the first work to perform spatio-temporal mapping of human
activity using the visual content of geo-tagged videos. We utilize a recent
deep-learning based video analysis framework, termed hidden two-stream
networks, to recognize a range of activities in YouTube videos. This framework
is efficient and can run in real time or faster which is important for
recognizing events as they occur in streaming video or for reducing latency in
analyzing already captured video. This is, in turn, important for using video
in smart-city applications. We perform a series of experiments to show our
approach is able to accurately map activities both spatially and temporally. We
also demonstrate the advantages of using the visual content over the
tags/titles.Comment: Accepted at ACM SIGSPATIAL 201
- …