81 research outputs found
Temporal sparse feature auto-combination deep network for video action recognition
In order to deal with action recognition for large‐scale video data, we present a spatio‐temporal auto‐combination deep network, which is able to extract deep features from short video segments by making full use of temporal contextual correlation of corresponding pixels among successive video frames. Based on conventional sparse encoding, we further consider the representative features in adjacent nodes of the hidden layers according to activation states similarities. A sparse auto‐combination strategy is applied to multiple input maps in each convolution stage. An information constraint of the representative features of hidden layer nodes is imposed to handle the adaptive sparse encoding of the topology. As a result, the learned features can represent the spatio‐temporal transition relationships better and the number of hidden nodes can be restricted to a certain range.
We conduct a series of experiments on two public data sets. The experimental results show that our approach is more effective and robust in video action recognition compared with traditional methods
Much Ado About Time: Exhaustive Annotation of Temporal Data
Large-scale annotated datasets allow AI systems to learn from and build upon
the knowledge of the crowd. Many crowdsourcing techniques have been developed
for collecting image annotations. These techniques often implicitly rely on the
fact that a new input image takes a negligible amount of time to perceive. In
contrast, we investigate and determine the most cost-effective way of obtaining
high-quality multi-label annotations for temporal data such as videos. Watching
even a short 30-second video clip requires a significant time investment from a
crowd worker; thus, requesting multiple annotations following a single viewing
is an important cost-saving strategy. But how many questions should we ask per
video? We conclude that the optimal strategy is to ask as many questions as
possible in a HIT (up to 52 binary questions after watching a 30-second video
clip in our experiments). We demonstrate that while workers may not correctly
answer all questions, the cost-benefit analysis nevertheless favors consensus
from multiple such cheap-yet-imperfect iterations over more complex
alternatives. When compared with a one-question-per-video baseline, our method
is able to achieve a 10% improvement in recall 76.7% ours versus 66.7%
baseline) at comparable precision (83.8% ours versus 83.0% baseline) in about
half the annotation time (3.8 minutes ours compared to 7.1 minutes baseline).
We demonstrate the effectiveness of our method by collecting multi-label
annotations of 157 human activities on 1,815 videos.Comment: HCOMP 2016 Camera Read
- …