586 research outputs found

    Audio-Visual Learning for Scene Understanding

    Get PDF
    Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time

    Machine translation and post-editing in widlife documentaries: challenges and posiible solutions

    Get PDF
    Este artículo presenta algunos de los desafíos que pueden presentarse si introducimos traducción automática (TA) en el proceso de traducción de documentales de naturaleza. Hasta ahora, TA se ha usado para traducir textos escritos de carácter general y especializado. A pesar de ello, en los últimos años, proyectos financiados por la UE han empezado a trabajar en el ámbito de la traducción audiovisual con el objetivo de usar TA para traducir subtítulos y ya se ha demostrado que los subtítulos poseditados pueden llegar a niveles de calidad adecuados. Pero los documentales no solo pueden traducirse mediante subtítulos que, en países donde la subtitulación no es el principal modo de transferencia audiovisual, se usan voces superpuestas y doblaje en off para hacerlo. Es por este motivo que creemos necesario investigar la introducción de TA para traducir documentales de naturaleza mediante voces superpuestas y doblaje en off. Este artículo describe los desafíos que conlleva traducir automáticamente guiones de documentales presentado un análisis preliminar de las traducciones producidas por distintos motores de traducción automática. En primer lugar aportamos una visión general de las características de las voces superpuestas y el doblaje en off, así como un breve resumen de anteriores investigaciones en las que se intenta introducir TA en el ámbito de la traducción audiovisual. A continuación presentamos la metodología usada para llevar a cabo el análisis de un corpus de guiones de documentales, por un lado, y de un corpus de traducciones automáticas de estos mismos guiones, por el otro. Finalmente, antes de resumir posibles nuevas investigaciones derivadas de este artículo, esclarecemos los posibles desafíos con los que podríamos encontrarnos para conseguir traducciones de guiones de documentales de calidad usando TA, presentamos los resultados de los análisis y sugerimos posibles soluciones a estos desafíos

    Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation

    Full text link

    Chapter From the Lab to the Real World: Affect Recognition Using Multiple Cues and Modalities

    Get PDF
    Interdisciplinary concept of dissipative soliton is unfolded in connection with ultrafast fibre lasers. The different mode-locking techniques as well as experimental realizations of dissipative soliton fibre lasers are surveyed briefly with an emphasis on their energy scalability. Basic topics of the dissipative soliton theory are elucidated in connection with concepts of energy scalability and stability. It is shown that the parametric space of dissipative soliton has reduced dimension and comparatively simple structure that simplifies the analysis and optimization of ultrafast fibre lasers. The main destabilization scenarios are described and the limits of energy scalability are connected with impact of optical turbulence and stimulated Raman scattering. The fast and slow dynamics of vector dissipative solitons are exposed

    PERFORMANCE ANALYSIS OF AUDIO AND VIDEO SYNCHRONIZATION USING SPREADED CODE DELAY MEASUREMENT TECHNIQUE

    Get PDF
    The audio and video synchronization plays an important role in speech recognition and multimedia communication. The audio-video sync is a quite significant problem in live video conferencing. It is due to use of various hardware components which introduces variable delay and software environments. The synchronization loss between audio and video causes viewers not to enjoy the program and decreases the effectiveness of the programs. The objective of the synchronization is used to preserve the temporal alignment between the audio and video signals. This paper proposes the audio-video synchronization using spreading codes delay measurement technique. The performance of the proposed method made on home database and achieves 99% synchronization efficiency. The audio-visual signature technique provides a significant reduction in audio-video sync problems and the performance analysis of audio and video synchronization in an effective way. This paper also implements an audio- video synchronizer and analyses its performance in an efficient manner by synchronization efficiency, audio-video time drift and audio-video delay parameters. The simulation result is carried out using Matlab simulation tools and Simulink. It is automatically estimating and correcting the timing relationship between the audio and video signals and maintaining the Quality of Service

    Computational Multimedia for Video Self Modeling

    Get PDF
    Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of oneself. This is the idea behind the psychological theory of self-efficacy - you can learn or model to perform certain tasks because you see yourself doing it, which provides the most ideal form of behavior modeling. The effectiveness of VSM has been demonstrated for many different types of disabilities and behavioral problems ranging from stuttering, inappropriate social behaviors, autism, selective mutism to sports training. However, there is an inherent difficulty associated with the production of VSM material. Prolonged and persistent video recording is required to capture the rare, if not existed at all, snippets that can be used to string together in forming novel video sequences of the target skill. To solve this problem, in this dissertation, we use computational multimedia techniques to facilitate the creation of synthetic visual content for self-modeling that can be used by a learner and his/her therapist with a minimum amount of training data. There are three major technical contributions in my research. First, I developed an Adaptive Video Re-sampling algorithm to synthesize realistic lip-synchronized video with minimal motion jitter. Second, to denoise and complete the depth map captured by structure-light sensing systems, I introduced a layer based probabilistic model to account for various types of uncertainties in the depth measurement. Third, I developed a simple and robust bundle-adjustment based framework for calibrating a network of multiple wide baseline RGB and depth cameras

    MediaSync: Handbook on Multimedia Synchronization

    Get PDF
    This book provides an approachable overview of the most recent advances in the fascinating field of media synchronization (mediasync), gathering contributions from the most representative and influential experts. Understanding the challenges of this field in the current multi-sensory, multi-device, and multi-protocol world is not an easy task. The book revisits the foundations of mediasync, including theoretical frameworks and models, highlights ongoing research efforts, like hybrid broadband broadcast (HBB) delivery and users' perception modeling (i.e., Quality of Experience or QoE), and paves the way for the future (e.g., towards the deployment of multi-sensory and ultra-realistic experiences). Although many advances around mediasync have been devised and deployed, this area of research is getting renewed attention to overcome remaining challenges in the next-generation (heterogeneous and ubiquitous) media ecosystem. Given the significant advances in this research area, its current relevance and the multiple disciplines it involves, the availability of a reference book on mediasync becomes necessary. This book fills the gap in this context. In particular, it addresses key aspects and reviews the most relevant contributions within the mediasync research space, from different perspectives. Mediasync: Handbook on Multimedia Synchronization is the perfect companion for scholars and practitioners that want to acquire strong knowledge about this research area, and also approach the challenges behind ensuring the best mediated experiences, by providing the adequate synchronization between the media elements that constitute these experiences
    corecore