6,881 research outputs found

    A Variability-Based Testing Approach for Synthesizing Video Sequences

    Get PDF
    A key problem when developing video processing software is the di culty to test di erent input combinations. In this paper, we present VANE, a variability-based testing approach to derive video sequence variants. The ideas of VANE are i) to encode in a variability model what can vary within a video sequence; ii) to exploit the variability model to generate testable con gurations; iii) to synthesize variants of video sequences corresponding to con gurations. VANE computes T-wise covering sets while optimizing a function over attributes. Also, we present a preliminary validation of the scalability and practicality of VANE in the context of an industrial project involving the test of video processing algorithms.Ministerio de EconomĂ­a y Competitividad TIN2012-32273Junta de AndalucĂ­a TIC-5906Junta de AndalucĂ­a P12-TIC-186

    ViViD: A Variability-Based Tool for Synthesizing Video Sequences

    Get PDF
    We present ViViD, a variability-based tool to synthesize variants of video sequences. ViViD is developed and used in the context of an industrial project involving consumers and providers of video processing algorithms. The goal is to synthesize synthetic video variants with a wide range of characteristics to then test the algorithms. We describe the key components of ViViD (1) a variability language and an environment to model what can vary within a video sequence;(2) a reasoning back-end to generate relevant testing configurations; (3) a video synthesizer in charge of producing variants of video sequences corresponding to configurations. We show how ViViD can synthesize realistic videos with differ-ent characteristics such as luminances, vehicles and persons that cover a diversity of testing scenarios

    Unsupervised Video Understanding by Reconciliation of Posture Similarities

    Full text link
    Understanding human activity and being able to explain it in detail surpasses mere action classification by far in both complexity and value. The challenge is thus to describe an activity on the basis of its most fundamental constituents, the individual postures and their distinctive transitions. Supervised learning of such a fine-grained representation based on elementary poses is very tedious and does not scale. Therefore, we propose a completely unsupervised deep learning procedure based solely on video sequences, which starts from scratch without requiring pre-trained networks, predefined body models, or keypoints. A combinatorial sequence matching algorithm proposes relations between frames from subsets of the training data, while a CNN is reconciling the transitivity conflicts of the different subsets to learn a single concerted pose embedding despite changes in appearance across sequences. Without any manual annotation, the model learns a structured representation of postures and their temporal development. The model not only enables retrieval of similar postures but also temporal super-resolution. Additionally, based on a recurrent formulation, next frames can be synthesized.Comment: Accepted by ICCV 201

    Modeling Variability in the Video Domain: Language and Experience Report

    Get PDF
    This paper reports about a new domain-specific variability modeling language, called VM, resulting from the close collaboration with industrial partners in the video domain. We expose the requirements and advanced variability constructs required to characterize and realize variations of physical properties of a video (such as objects' speed or scene illumination). The results of our experiments and industrial experience show that VM is effective to model complex variability information and can be exploited to synthesize video variants. We concluded that basic variability mechanisms are useful but not enough, attributes and multi-features are of prior importance, and meta-information is relevant for efficient variability analysis. In addition, we questioned the existence of one-size-fits-all variability modeling solution applicable in any industry. Yet, some common needs for modeling variability are becoming apparent such as support for attributes and multi-features.Ce document dĂ©crit un nouveau langage de modĂ©lisation dĂ©diĂ©e Ă  la variabilitĂ©, appelĂ© VM, rĂ©sultant de la collaboration avec des partenaires industriels dans le domaine de la vidĂ©o. Nous exposons les exigences et les constructions de la variabilitĂ© avancĂ©es requises pour caractĂ©riser et implĂ©menter les variations des propriĂ©tĂ©s physiques d'une vidĂ©o (tels que la vitesse des objets ou l'illumination de la scĂšne). Les rĂ©sultats de nos expĂ©rimentations et de l'expĂ©rience industrielle montrent que VM est efficace pour modĂ©liser l'information de variabilitĂ© complexe et peut ĂȘtre exploitĂ©e pour synthĂ©tiser des variantes de vidĂ©o. Nous avons conclu que les mĂ©canismes basiques de la variabilitĂ© sont certes utiles, mais insuffisants. Les attributs et multi-caractĂ©ristiques sont nĂ©cessaires alors que les mĂ©ta-informations sont pertinentes pour une analyse efficace de la variabilitĂ©. En s'appuyant sur notre expĂ©rience, nous mettons en doute l'existence d'une solution de modĂ©lisation de la variabilitĂ© applicable Ă  n'importe quelle industrie et domaine. NĂ©anmoins, certains besoins communs pour la modĂ©lisation de la variabilitĂ© Ă  sont apparents, comme le support pour les attributs et multi-caractĂ©ristiques

    Towards Automatic Speech Identification from Vocal Tract Shape Dynamics in Real-time MRI

    Full text link
    Vocal tract configurations play a vital role in generating distinguishable speech sounds, by modulating the airflow and creating different resonant cavities in speech production. They contain abundant information that can be utilized to better understand the underlying speech production mechanism. As a step towards automatic mapping of vocal tract shape geometry to acoustics, this paper employs effective video action recognition techniques, like Long-term Recurrent Convolutional Networks (LRCN) models, to identify different vowel-consonant-vowel (VCV) sequences from dynamic shaping of the vocal tract. Such a model typically combines a CNN based deep hierarchical visual feature extractor with Recurrent Networks, that ideally makes the network spatio-temporally deep enough to learn the sequential dynamics of a short video clip for video classification tasks. We use a database consisting of 2D real-time MRI of vocal tract shaping during VCV utterances by 17 speakers. The comparative performances of this class of algorithms under various parameter settings and for various classification tasks are discussed. Interestingly, the results show a marked difference in the model performance in the context of speech classification with respect to generic sequence or video classification tasks.Comment: To appear in the INTERSPEECH 2018 Proceeding

    Capture, Learning, and Synthesis of 3D Speaking Styles

    Full text link
    Audio-driven 3D facial animation has been widely explored, but achieving realistic, human-like performance is still unsolved. This is due to the lack of available 3D datasets, models, and standard evaluation metrics. To address this, we introduce a unique 4D face dataset with about 29 minutes of 4D scans captured at 60 fps and synchronized audio from 12 speakers. We then train a neural network on our dataset that factors identity from facial motion. The learned model, VOCA (Voice Operated Character Animation) takes any speech signal as input - even speech in languages other than English - and realistically animates a wide range of adult faces. Conditioning on subject labels during training allows the model to learn a variety of realistic speaking styles. VOCA also provides animator controls to alter speaking style, identity-dependent facial shape, and pose (i.e. head, jaw, and eyeball rotations) during animation. To our knowledge, VOCA is the only realistic 3D facial animation model that is readily applicable to unseen subjects without retargeting. This makes VOCA suitable for tasks like in-game video, virtual reality avatars, or any scenario in which the speaker, speech, or language is not known in advance. We make the dataset and model available for research purposes at http://voca.is.tue.mpg.de.Comment: To appear in CVPR 201

    Speech-Driven 3D Face Animation with Composite and Regional Facial Movements

    Full text link
    Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements. This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation. The composite nature pertains to how speech-independent factors globally modulate speech-driven facial movements along the temporal dimension. Meanwhile, the regional nature alludes to the notion that facial movements are not globally correlated but are actuated by local musculature along the spatial dimension. It is thus indispensable to incorporate both natures for engendering vivid animation. To address the composite nature, we introduce an adaptive modulation module that employs arbitrary facial movements to dynamically adjust speech-driven facial movements across frames on a global scale. To accommodate the regional nature, our approach ensures that each constituent of the facial features for every frame focuses on the local spatial movements of 3D faces. Moreover, we present a non-autoregressive backbone for translating audio to 3D facial movements, which maintains high-frequency nuances of facial movements and facilitates efficient inference. Comprehensive experiments and user studies demonstrate that our method surpasses contemporary state-of-the-art approaches both qualitatively and quantitatively.Comment: Accepted by MM 2023, 9 pages, 7 figures. arXiv admin note: text overlap with arXiv:2303.0979
