6,881 research outputs found
A Variability-Based Testing Approach for Synthesizing Video Sequences
A key problem when developing video processing software is
the di culty to test di erent input combinations. In this
paper, we present VANE, a variability-based testing
approach to derive video sequence variants. The ideas of
VANE are i) to encode in a variability model what can vary
within a video sequence; ii) to exploit the variability model to
generate testable con gurations; iii) to synthesize variants of
video sequences corresponding to con gurations. VANE
computes T-wise covering sets while optimizing a function
over attributes. Also, we present a preliminary validation of
the scalability and practicality of VANE in the context of an
industrial project involving the test of video processing
algorithms.Ministerio de EconomĂa y Competitividad TIN2012-32273Junta de AndalucĂa TIC-5906Junta de AndalucĂa P12-TIC-186
ViViD: A Variability-Based Tool for Synthesizing Video Sequences
We present ViViD, a variability-based tool to synthesize
variants of video sequences. ViViD is developed and used in
the context of an industrial project involving consumers and
providers of video processing algorithms. The goal is to
synthesize synthetic video variants with a wide range of characteristics
to then test the algorithms. We describe the key
components of ViViD (1) a variability language and an environment
to model what can vary within a video sequence;(2)
a reasoning back-end to generate relevant testing configurations;
(3) a video synthesizer in charge of producing variants
of video sequences corresponding to configurations. We
show how ViViD can synthesize realistic videos with differ-ent
characteristics such as luminances, vehicles and persons that
cover a diversity of testing scenarios
Unsupervised Video Understanding by Reconciliation of Posture Similarities
Understanding human activity and being able to explain it in detail surpasses
mere action classification by far in both complexity and value. The challenge
is thus to describe an activity on the basis of its most fundamental
constituents, the individual postures and their distinctive transitions.
Supervised learning of such a fine-grained representation based on elementary
poses is very tedious and does not scale. Therefore, we propose a completely
unsupervised deep learning procedure based solely on video sequences, which
starts from scratch without requiring pre-trained networks, predefined body
models, or keypoints. A combinatorial sequence matching algorithm proposes
relations between frames from subsets of the training data, while a CNN is
reconciling the transitivity conflicts of the different subsets to learn a
single concerted pose embedding despite changes in appearance across sequences.
Without any manual annotation, the model learns a structured representation of
postures and their temporal development. The model not only enables retrieval
of similar postures but also temporal super-resolution. Additionally, based on
a recurrent formulation, next frames can be synthesized.Comment: Accepted by ICCV 201
Modeling Variability in the Video Domain: Language and Experience Report
This paper reports about a new domain-specific variability modeling language, called VM, resulting from the close collaboration with industrial partners in the video domain. We expose the requirements and advanced variability constructs required to characterize and realize variations of physical properties of a video (such as objects' speed or scene illumination). The results of our experiments and industrial experience show that VM is effective to model complex variability information and can be exploited to synthesize video variants. We concluded that basic variability mechanisms are useful but not enough, attributes and multi-features are of prior importance, and meta-information is relevant for efficient variability analysis. In addition, we questioned the existence of one-size-fits-all variability modeling solution applicable in any industry. Yet, some common needs for modeling variability are becoming apparent such as support for attributes and multi-features.Ce document dĂ©crit un nouveau langage de modĂ©lisation dĂ©diĂ©e Ă la variabilitĂ©, appelĂ© VM, rĂ©sultant de la collaboration avec des partenaires industriels dans le domaine de la vidĂ©o. Nous exposons les exigences et les constructions de la variabilitĂ© avancĂ©es requises pour caractĂ©riser et implĂ©menter les variations des propriĂ©tĂ©s physiques d'une vidĂ©o (tels que la vitesse des objets ou l'illumination de la scĂšne). Les rĂ©sultats de nos expĂ©rimentations et de l'expĂ©rience industrielle montrent que VM est efficace pour modĂ©liser l'information de variabilitĂ© complexe et peut ĂȘtre exploitĂ©e pour synthĂ©tiser des variantes de vidĂ©o. Nous avons conclu que les mĂ©canismes basiques de la variabilitĂ© sont certes utiles, mais insuffisants. Les attributs et multi-caractĂ©ristiques sont nĂ©cessaires alors que les mĂ©ta-informations sont pertinentes pour une analyse efficace de la variabilitĂ©. En s'appuyant sur notre expĂ©rience, nous mettons en doute l'existence d'une solution de modĂ©lisation de la variabilitĂ© applicable Ă n'importe quelle industrie et domaine. NĂ©anmoins, certains besoins communs pour la modĂ©lisation de la variabilitĂ© Ă sont apparents, comme le support pour les attributs et multi-caractĂ©ristiques
Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI
Vocal tract configurations play a vital role in generating distinguishable
speech sounds, by modulating the airflow and creating different resonant
cavities in speech production. They contain abundant information that can be
utilized to better understand the underlying speech production mechanism. As a
step towards automatic mapping of vocal tract shape geometry to acoustics, this
paper employs effective video action recognition techniques, like Long-term
Recurrent Convolutional Networks (LRCN) models, to identify different
vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract.
Such a model typically combines a CNN based deep hierarchical visual feature
extractor with Recurrent Networks, that ideally makes the network
spatio-temporally deep enough to learn the sequential dynamics of a short video
clip for video classification tasks. We use a database consisting of 2D
real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The
comparative performances of this class of algorithms under various parameter
settings and for various classification tasks are discussed. Interestingly, the
results show a marked difference in the model performance in the context of
speech classification with respect to generic sequence or video classification
tasks.Comment: To appear in the INTERSPEECH 2018 Proceeding
Capture, Learning, and Synthesis of 3D Speaking Styles
Audio-driven 3D facial animation has been widely explored, but achieving
realistic, human-like performance is still unsolved. This is due to the lack of
available 3D datasets, models, and standard evaluation metrics. To address
this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans
captured at 60 fps and synchronized audio from 12 speakers. We then train a
neural network on our dataset that factors identity from facial motion. The
learned model, VOCA (Voice Operated Character Animation) takes any speech
signal as input - even speech in languages other than English - and
realistically animates a wide range of adult faces. Conditioning on subject
labels during training allows the model to learn a variety of realistic
speaking styles. VOCA also provides animator controls to alter speaking style,
identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball
rotations) during animation. To our knowledge, VOCA is the only realistic 3D
facial animation model that is readily applicable to unseen subjects without
retargeting. This makes VOCA suitable for tasks like in-game video, virtual
reality avatars, or any scenario in which the speaker, speech, or language is
not known in advance. We make the dataset and model available for research
purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201
Speech-Driven 3D Face Animation with Composite and Regional Facial Movements
Speech-driven 3D face animation poses significant challenges due to the
intricacy and variability inherent in human facial movements. This paper
emphasizes the importance of considering both the composite and regional
natures of facial movements in speech-driven 3D face animation. The composite
nature pertains to how speech-independent factors globally modulate
speech-driven facial movements along the temporal dimension. Meanwhile, the
regional nature alludes to the notion that facial movements are not globally
correlated but are actuated by local musculature along the spatial dimension.
It is thus indispensable to incorporate both natures for engendering vivid
animation. To address the composite nature, we introduce an adaptive modulation
module that employs arbitrary facial movements to dynamically adjust
speech-driven facial movements across frames on a global scale. To accommodate
the regional nature, our approach ensures that each constituent of the facial
features for every frame focuses on the local spatial movements of 3D faces.
Moreover, we present a non-autoregressive backbone for translating audio to 3D
facial movements, which maintains high-frequency nuances of facial movements
and facilitates efficient inference. Comprehensive experiments and user studies
demonstrate that our method surpasses contemporary state-of-the-art approaches
both qualitatively and quantitatively.Comment: Accepted by MM 2023, 9 pages, 7 figures. arXiv admin note: text
overlap with arXiv:2303.0979
- âŠ