145 research outputs found

    Evaluating Example-based Pose Estimation: Experiments on the HumanEva Sets

    Get PDF
    We present an example-based approach to pose recovery, using histograms of oriented gradients as image descriptors. Tests on the HumanEva-I and HumanEva-II data sets provide us insight into the strengths and limitations of an example-based approach. We report mean relative 3D errors of approximately 65 mm per joint on HumanEva-I, and 175 mm on HumanEva-II. We discuss our results using single and multiple views. Also, we perform experiments to assess the algorithm’s generalization to unseen subjects, actions and viewpoints. We plan to incorporate the temporal aspect of human motion analysis to reduce orientation ambiguities, and increase the pose recovery accuracy

    Preface: Facial and Bodily Expressions for Control and Adaptation of Games

    Get PDF

    Online backchannel synthesis evaluation with the switching Wizard of Oz

    Get PDF
    In this paper, we evaluate a backchannel synthesis algorithm in an online conversation between a human speaker and a virtual listener. We adopt the Switching Wizard of Oz (SWOZ) approach to assess behavior synthesis algorithms online. A human speaker watches a virtual listener that is either controlled by a human listener or by an algorithm. The source switches at random intervals. Speakers indicate when they feel they are no longer talking to a human listener. Analysis of these responses reveals patterns of inappropriate behavior in terms of quantity and timing of backchannels

    Automatic behavior analysis in tag games: from traditional spaces to interactive playgrounds

    Get PDF
    Tag is a popular children’s playground game. It revolves around taggers that chase and then tag runners, upon which their roles switch. There are many variations of the game that aim to keep children engaged by presenting them with challenges and different types of gameplay. We argue that the introduction of sensing and floor projection technology in the playground can aid in providing both variation and challenge. To this end, we need to understand players’ behavior in the playground and steer the interactions using projections accordingly. In this paper, we first analyze the behavior of taggers and runners in a traditional tag setting. We focus on behavioral cues that differ between the two roles. Based on these, we present a probabilistic role recognition model. We then move to an interactive setting and evaluate the model on tag sessions in an interactive tag playground. Our model achieves 77.96 % accuracy, which demonstrates the feasibility of our approach. We identify several avenues for improvement. Eventually, these should lead to a more thorough understanding of what happens in the playground, not only regarding player roles but also when the play breaks down, for example when players are bored or cheat

    Learn to cycle: Time-consistent feature discovery for action recognition

    Get PDF
    Generalizing over temporal variations is a prerequisite for effective action recognition in videos. Despite significant advances in deep neural networks, it remains a challenge to focus on short-term discriminative motions in relation to the overall performance of an action. We address this challenge by allowing some flexibility in discovering relevant spatio-temporal features. We introduce Squeeze and Recursion Temporal Gates (SRTG), an approach that favors inputs with similar activations with potential temporal variations. We implement this idea with a novel CNN block that uses an LSTM to encapsulate feature dynamics, in conjunction with a temporal gate that is responsible for evaluating the consistency of the discovered dynamics and the modeled features. We show consistent improvement when using SRTG blocks, with only a minimal increase in the number of GFLOPs. On Kinetics-700, we perform on par with current state-of-the-art models, and outperform these on HACS, Moments in Time, UCF-101 and HMDB-51

    Example-based pose estimation in monocular images using compact fourier descriptors

    Get PDF
    Automatically estimating human poses from visual input is useful but challenging due to variations in image space and the high dimensionality of the pose space. In this paper, we assume that a human silhouette can be extracted from monocular visual input. We compare the recovery performance of Fourier descriptors with a number of coefficients between 8 and 128, and two different sampling methods. An examplebased approach is taken to recover upper body poses from the descriptors. We test the robustness of our approach by investigating how shape deformations due to changes in body dimensions, viewpoint and noise affect the recovery of the pose. The average error per joint is approximately 16-17° for equidistant sampling and slightly higher for extreme point sampling. Increasing the number of descriptors does not have any influence on the performance. Noise and small changes in viewpoint have only a very small effect on the recovery performance but we obtain higher error scores when recovering poses using silhouettes from a person with different body dimensions

    Multi-Temporal Convolutions for Human Action Recognition in Videos

    Get PDF
    Effective extraction of temporal patterns is crucial for the recognition of temporally varying actions in video. We argue that the fixed-sized spatio-temporal convolution kernels used in convolutional neural networks (CNNs) can be improved to extract informative motions that are executed at different time scales. To address this challenge, we present a novel spatio-temporal convolution block that is capable of extracting spatio-temporal patterns at multiple temporal resolutions. Our proposed multi-temporal convolution (MTConv) blocks utilize two branches that focus on brief and prolonged spatio-temporal patterns, respectively. The extracted time-varying features are aligned in a third branch, with respect to global motion patterns through recurrent cells. The proposed blocks are lightweight and can be integrated into any 3D-CNN architecture. This introduces a substantial reduction in computational costs. Extensive experiments on Kinetics, Moments in Time and HACS action recognition benchmark datasets demonstrate competitive performance of MTConvs compared to the state-of-the-art with a significantly lower computational footprint

    AdaPool:Exponential Adaptive Pooling for Information-Retaining Downsampling

    Get PDF
    Pooling layers are essential building blocks of Convolutional Neural Networks (CNNs) that reduce computational overhead and increase the receptive fields of proceeding convolutional operations. They aim to produce downsampled volumes that closely resemble the input volume while, ideally, also being computationally and memory efficient. It is a challenge to meet both requirements jointly. To this end, we propose an adaptive and exponentially weighted pooling method named adaPool. Our proposed method uses a parameterized fusion of two sets of pooling kernels that are based on the exponent of the Dice-Sorensen coefficient and the exponential maximum, respectively. A key property of adaPool is its bidirectional nature. In contrast to common pooling methods, weights can be used to upsample a downsampled activation map. We term this method adaUnPool. We demonstrate how adaPool improves the preservation of detail through a range of tasks including image and video classification and object detection. We then evaluate adaUnPool on image and video frame super-resolution and frame interpolation tasks. For benchmarking, we introduce Inter4K, a novel high-quality, high frame-rate video dataset. Our combined experiments demonstrate that adaPool systematically achieves better results across tasks and backbone architectures, while introducing a minor additional computational and memory overhead
    • …
    corecore