1,461 research outputs found
Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths.
Study of Compression Statistics and Prediction of Rate-Distortion Curves for Video Texture
Encoding textural content remains a challenge for current standardised video
codecs. It is therefore beneficial to understand video textures in terms of
both their spatio-temporal characteristics and their encoding statistics in
order to optimize encoding performance. In this paper, we analyse the
spatio-temporal features and statistics of video textures, explore the
rate-quality performance of different texture types and investigate models to
mathematically describe them. For all considered theoretical models, we employ
machine-learning regression to predict the rate-quality curves based solely on
selected spatio-temporal features extracted from uncompressed content. All
experiments were performed on homogeneous video textures to ensure validity of
the observations. The results of the regression indicate that using an
exponential model we can more accurately predict the expected rate-quality
curve (with a mean Bj{\o}ntegaard Delta rate of 0.46% over the considered
dataset) while maintaining a low relative complexity. This is expected to be
adopted by in the loop processes for faster encoding decisions such as
rate-distortion optimisation, adaptive quantization, partitioning, etc.Comment: 17 page
Velocity-Based LOD Reduction in Virtual Reality: A Psychometric Approach
Virtual Reality headsets enable users to explore the environment by
performing self-induced movements. The retinal velocity produced by such motion
reduces the visual system's ability to resolve fine detail. We measured the
impact of self-induced head rotations on the ability to detect quality changes
of a realistic 3D model in an immersive virtual reality environment. We varied
the Level-of-Detail (LOD) as a function of rotational head velocity with
different degrees of severity. Using a psychophysical method, we asked 17
participants to identify which of the two presented intervals contained the
higher quality model under two different maximum velocity conditions. After
fitting psychometric functions to data relating the percentage of correct
responses to the aggressiveness of LOD manipulations, we identified the
threshold severity for which participants could reliably (75\%) detect the
lower LOD model. Participants accepted an approximately four-fold LOD reduction
even in the low maximum velocity condition without a significant impact on
perceived quality, which suggests that there is considerable potential for
optimisation when users are moving (increased range of perceptual uncertainty).
Moreover, LOD could be degraded significantly more in the maximum head velocity
condition, suggesting these effects are indeed speed dependent
Content-Adaptive Variable Framerate Encoding Scheme for Green Live Streaming
Adaptive live video streaming applications use a fixed predefined
configuration for the bitrate ladder with constant framerate and encoding
presets in a session. However, selecting optimized framerates and presets for
every bitrate ladder representation can enhance perceptual quality, improve
computational resource allocation, and thus, the streaming energy efficiency.
In particular, low framerates for low-bitrate representations reduce
compression artifacts and decrease encoding energy consumption. In addition, an
optimized preset may lead to improved compression efficiency. To this light,
this paper proposes a Content-adaptive Variable Framerate (CVFR) encoding
scheme, which offers two modes of operation: ecological (ECO) and high-quality
(HQ). CVFR-ECO optimizes for the highest encoding energy savings by predicting
the optimized framerate for each representation in the bitrate ladder. CVFR-HQ
takes it further by predicting each representation's optimized
framerate-encoding preset pair using low-complexity discrete cosine transform
energy-based spatial and temporal features for compression efficiency and
sustainable storage. We demonstrate the advantage of CVFR using the x264
open-source video encoder. The results show that CVFR-ECO yields an average
PSNR and VMAF increase of 0.02 dB and 2.50 points, respectively, for the same
bitrate, compared to the fastest preset highest framerate encoding. CVFR-ECO
also yields an average encoding and storage energy consumption reduction of
34.54% and 76.24%, considering a just noticeable difference (JND) of six VMAF
points. In comparison, CVFR-HQ yields an average increase in PSNR and VMAF of
2.43 dB and 10.14 points, respectively, for the same bitrate. Finally, CVFR-HQ
resulted in an average reduction in storage energy consumption of 83.18%,
considering a JND of six VMAF points
Dance-the-music : an educational platform for the modeling, recognition and audiovisual monitoring of dance steps using spatiotemporal motion templates
In this article, a computational platform is presented, entitled “Dance-the-Music”, that can be used in a dance educational context to explore and learn the basics of dance steps. By introducing a method based on spatiotemporal motion templates, the platform facilitates to train basic step models from sequentially repeated dance figures performed by a dance teacher. Movements are captured with an optical motion capture system. The teachers’ models can be visualized from a first-person perspective to instruct students how to perform the specific dance steps in the correct manner. Moreover, recognition algorithms-based on a template matching method can determine the quality of a student’s performance in real time by means of multimodal monitoring techniques. The results of an evaluation study suggest that the Dance-the-Music is effective in helping dance students to master the basics of dance figures
- …