609 research outputs found

    Sounding objects

    Get PDF
    Taxonomy of philosophical theories of Sound: proximal theories; medial theories; distal theories. A distal theory: The Located Event Theory (LET) of sound. Understanding sound and the cognition of sounding objects; ontology of sound according to the LET; epistemology of the perception of sound and sounding objects; auditory images according to the LET; conceptual revisions entailed by distal theories and the LET; replies to objections

    A Virtual Reality Platform for Musical Creation

    No full text
    International audienceVirtual reality aims at interacting with a computer in a similar form to interacting with an object of the real world. This research presents a VR platform allowing the user (1) to interactively create physically-based musical instruments and sounding objects, (2) play them in real time by using multisensory interaction by ways of haptics, 3D visualisation during playing, and real time physically-based sound synthesis. So doing, our system presents the two main properties expected in VR systems: the possibility of designing any type of objects and manipulating them in a multisensory real time fashion. By presenting our environment, we discuss the scientific underlying questions: (1) concerning the real time simulation, the way to manage simultaneous audio-haptic-visual cooperation during the real time multisensory simulations and (2) the Computer Aided Design functionalities for the creation of new physically-based musical instruments and sounding objects

    Audio-aware Query-enhanced Transformer for Audio-Visual Segmentation

    Full text link
    The goal of the audio-visual segmentation (AVS) task is to segment the sounding objects in the video frames using audio cues. However, current fusion-based methods have the performance limitations due to the small receptive field of convolution and inadequate fusion of audio-visual features. To overcome these issues, we propose a novel \textbf{Au}dio-aware query-enhanced \textbf{TR}ansformer (AuTR) to tackle the task. Unlike existing methods, our approach introduces a multimodal transformer architecture that enables deep fusion and aggregation of audio-visual features. Furthermore, we devise an audio-aware query-enhanced transformer decoder that explicitly helps the model focus on the segmentation of the pinpointed sounding objects based on audio signals, while disregarding silent yet salient objects. Experimental results show that our method outperforms previous methods and demonstrates better generalization ability in multi-sound and open-set scenarios.Comment: arXiv admin note: text overlap with arXiv:2305.1101

    3D audio as an information-environment: manipulating perceptual significance for differntiation and pre-selection

    Get PDF
    Contemporary use of sound as artificial information display is rudimentary, with little 'depth of significance' to facilitate users' selective attention. We believe that this is due to conceptual neglect of 'context' or perceptual background information. This paper describes a systematic approach to developing 3D audio information environments that utilise known cognitive characteristics, in order to promote rapidity and ease of use. The key concepts are perceptual space, perceptual significance, ambience labelling information and cartoonification

    BAVS: Bootstrapping Audio-Visual Segmentation by Integrating Foundation Knowledge

    Full text link
    Given an audio-visual pair, audio-visual segmentation (AVS) aims to locate sounding sources by predicting pixel-wise maps. Previous methods assume that each sound component in an audio signal always has a visual counterpart in the image. However, this assumption overlooks that off-screen sounds and background noise often contaminate the audio recordings in real-world scenarios. They impose significant challenges on building a consistent semantic mapping between audio and visual signals for AVS models and thus impede precise sound localization. In this work, we propose a two-stage bootstrapping audio-visual segmentation framework by incorporating multi-modal foundation knowledge. In a nutshell, our BAVS is designed to eliminate the interference of background noise or off-screen sounds in segmentation by establishing the audio-visual correspondences in an explicit manner. In the first stage, we employ a segmentation model to localize potential sounding objects from visual data without being affected by contaminated audio signals. Meanwhile, we also utilize a foundation audio classification model to discern audio semantics. Considering the audio tags provided by the audio foundation model are noisy, associating object masks with audio tags is not trivial. Thus, in the second stage, we develop an audio-visual semantic integration strategy (AVIS) to localize the authentic-sounding objects. Here, we construct an audio-visual tree based on the hierarchical correspondence between sounds and object categories. We then examine the label concurrency between the localized objects and classified audio tags by tracing the audio-visual tree. With AVIS, we can effectively segment real-sounding objects. Extensive experiments demonstrate the superiority of our method on AVS datasets, particularly in scenarios involving background noise. Our project website is https://yenanliu.github.io/AVSS.github.io/

    Annotation-free Audio-Visual Segmentation

    Full text link
    The objective of Audio-Visual Segmentation (AVS) is to locate sounding objects within visual scenes by accurately predicting pixelwise segmentation masks. In this paper, we present the following contributions: (i), we propose a scalable and annotation-free pipeline for generating artificial data for the AVS task. We leverage existing image segmentation and audio datasets to draw links between category labels, image-mask pairs, and audio samples, which allows us to easily compose (image, audio, mask) triplets for training AVS models; (ii), we introduce a novel Audio-Aware Transformer (AuTR) architecture that features an audio-aware query-based transformer decoder. This architecture enables the model to search for sounding objects with the guidance of audio signals, resulting in more accurate segmentation; (iii), we present extensive experiments conducted on both synthetic and real datasets, which demonstrate the effectiveness of training AVS models with synthetic data generated by our proposed pipeline. Additionally, our proposed AuTR architecture exhibits superior performance and strong generalization ability on public benchmarks. The project page is https://jinxiang-liu.github.io/anno-free-AVS/.Comment: Under Revie

    Objects that Sound

    Full text link
    In this paper our objectives are, first, networks that can embed audio and visual inputs into a common space that is suitable for cross-modal retrieval; and second, a network that can localize the object that sounds in an image, given the audio signal. We achieve both these objectives by training from unlabelled video using only audio-visual correspondence (AVC) as the objective function. This is a form of cross-modal self-supervision from video. To this end, we design new network architectures that can be trained for cross-modal retrieval and localizing the sound source in an image, by using the AVC task. We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e.g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.Comment: Appears in: European Conference on Computer Vision (ECCV) 201
    • …
    corecore