2,449 research outputs found
Wavelets and their use
This review paper is intended to give a useful guide for those who want to
apply discrete wavelets in their practice. The notion of wavelets and their use
in practical computing and various applications are briefly described, but
rigorous proofs of mathematical statements are omitted, and the reader is just
referred to corresponding literature. The multiresolution analysis and fast
wavelet transform became a standard procedure for dealing with discrete
wavelets. The proper choice of a wavelet and use of nonstandard matrix
multiplication are often crucial for achievement of a goal. Analysis of various
functions with the help of wavelets allows to reveal fractal structures,
singularities etc. Wavelet transform of operator expressions helps solve some
equations. In practical applications one deals often with the discretized
functions, and the problem of stability of wavelet transform and corresponding
numerical algorithms becomes important. After discussing all these topics we
turn to practical applications of the wavelet machinery. They are so numerous
that we have to limit ourselves by some examples only. The authors would be
grateful for any comments which improve this review paper and move us closer to
the goal proclaimed in the first phrase of the abstract.Comment: 63 pages with 22 ps-figures, to be published in Physics-Uspekh
How Much Temporal Long-Term Context is Needed for Action Segmentation?
Modeling long-term context in videos is crucial for many fine-grained tasks
including temporal action segmentation. An interesting question that is still
open is how much long-term temporal context is needed for optimal performance.
While transformers can model the long-term context of a video, this becomes
computationally prohibitive for long videos. Recent works on temporal action
segmentation thus combine temporal convolutional networks with self-attentions
that are computed only for a local temporal window. While these approaches show
good results, their performance is limited by their inability to capture the
full context of a video. In this work, we try to answer how much long-term
temporal context is required for temporal action segmentation by introducing a
transformer-based model that leverages sparse attention to capture the full
context of a video. We compare our model with the current state of the art on
three datasets for temporal action segmentation, namely 50Salads, Breakfast,
and Assembly101. Our experiments show that modeling the full context of a video
is necessary to obtain the best performance for temporal action segmentation.Comment: ICCV 202
Semantic guided multi-future human motion prediction
L'obiettivo della tesi è quello di esplorare il possibile utilizzo di un modello basato su reti neurali già sviluppato per la previsione multi-futuro del moto di un agente umano. Data una traiettoria con informazione spaziale (sotto forma di angoli relativi dei giunti) di una struttura semplificata di scheletro umano, si cerca di aumentare l'accuratezza di previsione del modello grazie all'aggiunta di informazione semantica. Per informazione semantica si intende il significato ad alto livello dell'azione che l'agente umano sta compiendo.Investigate the potential utilization of a pre-existing neural network model, originally designed for multi-future prediction of human agent motion in a static camera scene, adapted to forecast rotational trajectories of human joints. By incorporating semantic information, pertaining to the higher-level depiction of the human agent's action, the objective is to enhance the prediction accuracy of the model. The study made use of the AMASS and BABEL datasets to achieve this purpose
Optimized Camera Handover Scheme in Free Viewpoint Video Streaming
Free-viewpoint video (FVV) is a promising approach that allows users to control their viewpoint and generate virtual views from any desired perspective. The individual user viewpoints are synthetized from two or more camera streams and correspondent depth sequences. In case of continuous viewpoint changes, the camera inputs of the view synthesis process must be changed in a seamless way, in order to avoid the starvation of the viewpoint synthesizer algorithm. Starvation occurs when the desired user viewpoint cannot be synthetized with the currently streamed camera views, thus the FVV playout interrupts. In this paper we proposed three camera handover schemes (TCC, MA, SA) based on viewpoint prediction in order to minimize the probability of playout stalls and find the tradeoff between the image quality and the camera handover frequency. Our simulation results show that the introduced camera switching methods can reduce the handover frequency with more than 40%, hence the viewpoint synthesis starvation and the playout interruption can be minimized. By providing seamless viewpoint changes, the quality of experience can be significantly improved, making the new FVV service more attractive in the future
Generating 3D faces using Convolutional Mesh Autoencoders
Learned 3D representations of human faces are useful for computer vision
problems such as 3D face tracking and reconstruction from images, as well as
graphics applications such as character generation and animation. Traditional
models learn a latent representation of a face using linear subspaces or
higher-order tensor generalizations. Due to this linearity, they can not
capture extreme deformations and non-linear expressions. To address this, we
introduce a versatile model that learns a non-linear representation of a face
using spectral convolutions on a mesh surface. We introduce mesh sampling
operations that enable a hierarchical mesh representation that captures
non-linear variations in shape and expression at multiple scales within the
model. In a variational setting, our model samples diverse realistic 3D faces
from a multivariate Gaussian distribution. Our training data consists of 20,466
meshes of extreme expressions captured over 12 different subjects. Despite
limited training data, our trained model outperforms state-of-the-art face
models with 50% lower reconstruction error, while using 75% fewer parameters.
We also show that, replacing the expression space of an existing
state-of-the-art face model with our autoencoder, achieves a lower
reconstruction error. Our data, model and code are available at
http://github.com/anuragranj/com
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
Automated and Real Time Subtle Facial Feature Tracker for Automatic Emotion Elicitation
This thesis proposed a system for real time detection of facial expressions those are subtle and are exhibited in spontaneous real world settings. The underlying frame work of our system is the open source implementation of Active Appearance Model. Our algorithm operates by grouping the various points provided by AAM into higher level regions constructing and updating a background statistical model of movement in each region, and testing whether current movement in a given region substantially exceeds the expected value of movement in that region (computed from statistical model). Movements that exceed the expected value by some threshold and do not appear to be false alarms due to artifacts (e.g., lighting changes) are considered to be valid changes in facial expressions. These changes are expected to be rough indicators of facial activity that can be complemented by contexual driven predictors of emotion that are derived from spontaneous settings
- …