5 research outputs found

    BatVision: Learning to See 3D Spatial Layout with Two Ears

    Full text link
    Many species have evolved advanced non-visual perception while artificial systems fall behind. Radar and ultrasound complement camera-based vision but they are often too costly and complex to set up for very limited information gain. In nature, sound is used effectively by bats, dolphins, whales, and humans for navigation and communication. However, it is unclear how to best harness sound for machine perception. Inspired by bats' echolocation mechanism, we design a low-cost BatVision system that is capable of seeing the 3D spatial layout of space ahead by just listening with two ears. Our system emits short chirps from a speaker and records returning echoes through microphones in an artificial human pinnae pair. During training, we additionally use a stereo camera to capture color images for calculating scene depths. We train a model to predict depth maps and even grayscale images from the sound alone. During testing, our trained BatVision provides surprisingly good predictions of 2D visual scenes from two 1D audio signals. Such a sound to vision system would benefit robot navigation and machine vision, especially in low-light or no-light conditions. Our code and data are publicly available

    Audio-Visual Learning for Scene Understanding

    Get PDF
    Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time

    Mirror Neurons and Empathy - A Biomarker for the Complementary System

    Get PDF
    openThe discovery of the mirror neuron system gave rise to many new paths within neuroscience, psychology, kinematics and philosophy. This neurological system helped foster a broader understanding of social interaction, as these neurons fire in action observation and action performance. To this day, it is identified that the mirror neuron system is related to action understanding, comprehension of other’s intentions and the ability to recognise another person’s mental state. Furthermore, a link between the mirror neuron system, empathy and prosocial behaviour is also well-established, whereas the complementary system enables us to understand and complete other individual’s actions. Hence, the objective of this study was to assess the relationship between the mirror neurons system, trait empathy and the ADM muscle activation in a complementary action setting, which is strongly related to the grasping movement within these contexts. Results revealed a relationship between low empathic traits and muscle activation in non-social conditions, suggesting that individuals scoring low on empathy seem less willing to help other people in a complementary action interplay.The discovery of the mirror neuron system gave rise to many new paths within neuroscience, psychology, kinematics and philosophy. This neurological system helped foster a broader understanding of social interaction, as these neurons fire in action observation and action performance. To this day, it is identified that the mirror neuron system is related to action understanding, comprehension of other’s intentions and the ability to recognise another person’s mental state. Furthermore, a link between the mirror neuron system, empathy and prosocial behaviour is also well-established, whereas the complementary system enables us to understand and complete other individual’s actions. Hence, the objective of this study was to assess the relationship between the mirror neurons system, trait empathy and the ADM muscle activation in a complementary action setting, which is strongly related to the grasping movement within these contexts. Results revealed a relationship between low empathic traits and muscle activation in non-social conditions, suggesting that individuals scoring low on empathy seem less willing to help other people in a complementary action interplay

    Audio-visual model distillation using acoustic images

    No full text
    In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or spectral data acquired by a single microphone, with remarkable results in classification and retrieval. However, such representations are not so robust towards variable environmental sound conditions. We tackle this drawback by exploiting a new multimodal labeled action recognition dataset acquired by a hybrid audio-visual sensor that provides RGB video, raw audio signals, and spatialized acoustic data, also known as acoustic images, where the visual and acoustic images are aligned in space and synchronized in time. Using this richer information, we train audio deep learning models in a teacher-student fashion. In particular, we distill knowledge into audio networks from both visual and acoustic image teachers. Our experiments suggest that the learned representations are more powerful and have better generalization capabilities than the features learned from models trained using just single-microphone audio data
    corecore