145 research outputs found
Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets
We present an example-based approach to pose recovery, using histograms of oriented gradients as image descriptors. Tests on the HumanEva-I and HumanEva-II data sets provide us insight into the strengths and limitations of an example-based approach. We report mean relative 3D errors of approximately 65 mm per joint on HumanEva-I, and 175 mm on HumanEva-II. We discuss our results using single and multiple views. Also, we perform experiments to assess the algorithmâs generalization to unseen subjects, actions and viewpoints. We plan to incorporate the temporal aspect of human motion analysis to reduce orientation ambiguities, and increase the pose recovery accuracy
Online backchannel synthesis evaluation with the switching Wizard of Oz
In this paper, we evaluate a backchannel synthesis algorithm in an online conversation between a human speaker and a virtual listener. We adopt the Switching Wizard of Oz (SWOZ) approach to assess behavior synthesis algorithms online. A human speaker watches a virtual listener that is either controlled by a human listener or by an algorithm. The source switches at random intervals. Speakers indicate when they feel they are no longer talking to a human listener. Analysis of these responses reveals patterns of inappropriate behavior in terms of quantity and timing of backchannels
Automatic behavior analysis in tag games: from traditional spaces to interactive playgrounds
Tag is a popular childrenâs playground game. It revolves around taggers that chase and then tag runners, upon which their roles switch. There are many variations of the game that aim to keep children engaged by presenting them with challenges and different types of gameplay. We argue that the introduction of sensing and floor projection technology in the playground can aid in providing both variation and challenge. To this end, we need to understand playersâ behavior in the playground and steer the interactions using projections accordingly. In this paper, we first analyze the behavior of taggers and runners in a traditional tag setting. We focus on behavioral cues that differ between the two roles. Based on these, we present a probabilistic role recognition model. We then move to an interactive setting and evaluate the model on tag sessions in an interactive tag playground. Our model achieves 77.96 % accuracy, which demonstrates the feasibility of our approach. We identify several avenues for improvement. Eventually, these should lead to a more thorough understanding of what happens in the playground, not only regarding player roles but also when the play breaks down, for example when players are bored or cheat
Learn to cycle: Time-consistent feature discovery for action recognition
Generalizing over temporal variations is a prerequisite for effective action
recognition in videos. Despite significant advances in deep neural networks, it
remains a challenge to focus on short-term discriminative motions in relation
to the overall performance of an action. We address this challenge by allowing
some flexibility in discovering relevant spatio-temporal features. We introduce
Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs
with similar activations with potential temporal variations. We implement this
idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics,
in conjunction with a temporal gate that is responsible for evaluating the
consistency of the discovered dynamics and the modeled features. We show
consistent improvement when using SRTG blocks, with only a minimal increase in
the number of GFLOPs. On Kinetics-700, we perform on par with current
state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101
and HMDB-51
Example-based pose estimation in monocular images using compact fourier descriptors
Automatically estimating human poses from visual input is useful but challenging due to variations in image space and the high dimensionality of the pose space. In this paper, we assume that a human silhouette can be extracted from monocular visual input. We compare the recovery performance of Fourier descriptors with a number of coefficients between 8 and 128, and two different sampling methods. An examplebased approach is taken to recover upper body poses from the descriptors. We test the robustness of our approach by investigating how shape deformations due to changes in body dimensions, viewpoint and noise affect the recovery of the pose. The average error per joint is approximately 16-17° for equidistant sampling and slightly higher for extreme point sampling. Increasing the number of descriptors does not have any influence on the performance. Noise and small changes in viewpoint have only a very small effect on the recovery performance but we obtain higher error scores when recovering poses using silhouettes from a person with different body dimensions
Multi-Temporal Convolutions for Human Action Recognition in Videos
Effective extraction of temporal patterns is crucial for the recognition of
temporally varying actions in video. We argue that the fixed-sized
spatio-temporal convolution kernels used in convolutional neural networks
(CNNs) can be improved to extract informative motions that are executed at
different time scales. To address this challenge, we present a novel
spatio-temporal convolution block that is capable of extracting spatio-temporal
patterns at multiple temporal resolutions. Our proposed multi-temporal
convolution (MTConv) blocks utilize two branches that focus on brief and
prolonged spatio-temporal patterns, respectively. The extracted time-varying
features are aligned in a third branch, with respect to global motion patterns
through recurrent cells. The proposed blocks are lightweight and can be
integrated into any 3D-CNN architecture. This introduces a substantial
reduction in computational costs. Extensive experiments on Kinetics, Moments in
Time and HACS action recognition benchmark datasets demonstrate competitive
performance of MTConvs compared to the state-of-the-art with a significantly
lower computational footprint
AdaPool:Exponential Adaptive Pooling for Information-Retaining Downsampling
Pooling layers are essential building blocks of Convolutional Neural Networks
(CNNs) that reduce computational overhead and increase the receptive fields of
proceeding convolutional operations. They aim to produce downsampled volumes
that closely resemble the input volume while, ideally, also being
computationally and memory efficient. It is a challenge to meet both
requirements jointly. To this end, we propose an adaptive and exponentially
weighted pooling method named adaPool. Our proposed method uses a parameterized
fusion of two sets of pooling kernels that are based on the exponent of the
Dice-Sorensen coefficient and the exponential maximum, respectively. A key
property of adaPool is its bidirectional nature. In contrast to common pooling
methods, weights can be used to upsample a downsampled activation map. We term
this method adaUnPool. We demonstrate how adaPool improves the preservation of
detail through a range of tasks including image and video classification and
object detection. We then evaluate adaUnPool on image and video frame
super-resolution and frame interpolation tasks. For benchmarking, we introduce
Inter4K, a novel high-quality, high frame-rate video dataset. Our combined
experiments demonstrate that adaPool systematically achieves better results
across tasks and backbone architectures, while introducing a minor additional
computational and memory overhead
- âŚ