9,082 research outputs found

    An audio-visual system for object-based audio : from recording to listening

    Get PDF
    Object-based audio is an emerging representation for audio content, where content is represented in a reproduction format-agnostic way and, thus, produced once for consumption on many different kinds of devices. This affords new opportunities for immersive, personalized, and interactive listening experiences. This paper introduces an end-to-end object-based spatial audio pipeline, from sound recording to listening. A high-level system architecture is proposed, which includes novel audiovisual interfaces to support object-based capture and listenertracked rendering, and incorporates a proposed component for objectification, that is, recording content directly into an object-based form. Text-based and extensible metadata enable communication between the system components. An open architecture for object rendering is also proposed. The system’s capabilities are evaluated in two parts. First, listener-tracked reproduction of metadata automatically estimated from two moving talkers is evaluated using an objective binaural localization model. Second, object-based scene capture with audio extracted using blind source separation (to remix between two talkers) and beamforming (to remix a recording of a jazz group) is evaluate

    Sound Source Separation

    Get PDF
    This is the author's accepted pre-print of the article, first published as G. Evangelista, S. Marchand, M. D. Plumbley and E. Vincent. Sound source separation. In U. Zölzer (ed.), DAFX: Digital Audio Effects, 2nd edition, Chapter 14, pp. 551-588. John Wiley & Sons, March 2011. ISBN 9781119991298. DOI: 10.1002/9781119991298.ch14file: Proof:e\EvangelistaMarchandPlumbleyV11-sound.pdf:PDF owner: markp timestamp: 2011.04.26file: Proof:e\EvangelistaMarchandPlumbleyV11-sound.pdf:PDF owner: markp timestamp: 2011.04.2

    A Parametric Sound Object Model for Sound Texture Synthesis

    Get PDF
    This thesis deals with the analysis and synthesis of sound textures based on parametric sound objects. An overview is provided about the acoustic and perceptual principles of textural acoustic scenes, and technical challenges for analysis and synthesis are considered. Four essential processing steps for sound texture analysis are identifi ed, and existing sound texture systems are reviewed, using the four-step model as a guideline. A theoretical framework for analysis and synthesis is proposed. A parametric sound object synthesis (PSOS) model is introduced, which is able to describe individual recorded sounds through a fi xed set of parameters. The model, which applies to harmonic and noisy sounds, is an extension of spectral modeling and uses spline curves to approximate spectral envelopes, as well as the evolution of parameters over time. In contrast to standard spectral modeling techniques, this representation uses the concept of objects instead of concatenated frames, and it provides a direct mapping between sounds of diff erent length. Methods for automatic and manual conversion are shown. An evaluation is presented in which the ability of the model to encode a wide range of di fferent sounds has been examined. Although there are aspects of sounds that the model cannot accurately capture, such as polyphony and certain types of fast modulation, the results indicate that high quality synthesis can be achieved for many different acoustic phenomena, including instruments and animal vocalizations. In contrast to many other forms of sound encoding, the parametric model facilitates various techniques of machine learning and intelligent processing, including sound clustering and principal component analysis. Strengths and weaknesses of the proposed method are reviewed, and possibilities for future development are discussed

    Subjective evaluation of spatial distorsions induced by a sound source separation process

    Get PDF
    International audienceThe fields of video games, simulations and virtual reality are now tending to develop increasingly high-performance, realistic and immersive technologies. Efforts are made in terms of sound devices and sound processing to synthesize realistic sound scenes in a 3-D environment. One of the greatest challenges is the ability to analyze a 3-D audio stream corresponding to a complex sound scene in its basic components (i.e. individual sound sources), to modify the scene (e.g. to change the locations of sound sources) and to resynthesize a modified 3-D audio stream. This situation is referred to as "spatial remix".Actually, the spatial remix problem is still an open field. Work in progress rely on sound separation algorithms to analyze a sound scene, but these techniques are not perfect and can damage the reconstructed source signals. These are referred to as "separation artefacts", including transient alteration of the target source and rejections of other sources into the target source. Objective and subjective evaluation of separation artefacts have been conducted [1], but these studies usually consider the separated source signals alone, i.e. when each source is listened to separately. This is different form the spatial remix problem, where all sources are listened to simultaneously.In that case, one may wonder if the separation artefacts can affect the spatial image of the synthesized 3-D sound scene. According to the perceptual mechanisms involved in spatial hearing, hypothesis can be made on the kind of spatial distortions that could occur in this context. Indeed, as transients are important cues to precisely localize sounds sources, its alteration may result in a localization blur or source widening. On the other hand, when separated sources are spatialized and played simultaneously, rejections of one source into another may also produce unwanted effects such as a feeling of moving sources and "phantom" sources emergence. This paper presents a new methodology to perceptually evaluate the spatial distortions that can occur in a spatial remix context. It consists in carrying out a localization test on complex scenes composed of three synthetic musical instruments played on a set of loudspeakers. In order to eliminate possible issues related to the spatial audio rendering device, we consider a simple case: We consider only three spatial positions, each corresponding to a single loudspeaker. Then, the spatial remix is restrained to a simple permutation of the source locations.The test is run through a virtual interface, using a head mounted display. The subject is placed in a simple visual virtual environment and is asked to surround with a remote the areas where each instrument is perceived. This experimental device allows the subject to report precisely both instruments position and size. A single instrument can also be spotted at multiple locations. Perceived source positions are approximated as ellipses from which center position and dimensions can easily be deduced. In order to quantify spatial distortions, the localization task is performed on both clean and degraded versions of the same musical extract. Localization performances in both cases are then compared taking the clean sources case as a reference. In this paper, the methodology is applied to assess the quality of Non-Negative Matrix Factorization source separation algorithm developped by Leglaive [2] which performs separation on convolutive mixtures.Our study reveals that the source separation process leads to perceptible degradations of the spatial image. Three main kinds of spatial distortions have been characterized. First, in the majority of degraded cases, "phantom" sources have been observed. This artifact mainly concerns percussive sources. The results also show a significant increase in the perceived width of the degraded sources. Finally, azimuth and elevation localization error is significantly higher in the case of scenes composed of separated sources.[1] V. Emiya, E. Vincent, N. Harlander and V. Hohmann, "Subjective and Objective Quality Assessment of Audio Source Separation," in <i>IEEE Transactions on Audio, Speech, and Language Processing</i>, vol. 19, no. 7, pp. 2046-2057, Sept. 2011.[2] S. Leglaive, R. Badeau and G. Richard, "Separating time-frequency sources from time-domain convolutive mixtures using non-negative matrix factorization," <i>2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)</i>, New Paltz, NY, 2017, pp. 264-268

    Virtual acoustics displays

    Get PDF
    The real time acoustic display capabilities are described which were developed for the Virtual Environment Workstation (VIEW) Project at NASA-Ames. The acoustic display is capable of generating localized acoustic cues in real time over headphones. An auditory symbology, a related collection of representational auditory 'objects' or 'icons', can be designed using ACE (Auditory Cue Editor), which links both discrete and continuously varying acoustic parameters with information or events in the display. During a given display scenario, the symbology can be dynamically coordinated in real time with 3-D visual objects, speech, and gestural displays. The types of displays feasible with the system range from simple warnings and alarms to the acoustic representation of multidimensional data or events

    Musical Source Separation: An Introduction

    Get PDF

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    Qualitative evaluation of media device orchestration for immersive spatial audio reproduction

    Get PDF
    The challenge of installing and setting up dedicated spatial audio systems can make it difficult to deliver immersive listening experiences to the general public. However, the proliferation of smart mobile devices and the rise of the Internet of Things mean that there are increasing numbers of connected devices capable of producing audio in the home. \Media device orchestration" (MDO) is the concept of utilizing an ad hoc set of devices to deliver or augment a media experience. In this paper, the concept is evaluated by implementing MDO for augmented spatial audio reproduction using object-based audio with semantic metadata. A thematic analysis of positive and negative listener comments about the system revealed three main categories of response: perceptual, technical, and content-dependent aspects. MDO performed particularly well in terms of immersion/envelopment, but the quality of listening experience was partly dependent on loudspeaker quality and listener position. Suggestions for further development based on these categories are given

    Perceptually motivated blind source separation of convolutive audio mixtures

    Get PDF

    Application of sound source separation methods to advanced spatial audio systems

    Full text link
    This thesis is related to the field of Sound Source Separation (SSS). It addresses the development and evaluation of these techniques for their application in the resynthesis of high-realism sound scenes by means of Wave Field Synthesis (WFS). Because the vast majority of audio recordings are preserved in twochannel stereo format, special up-converters are required to use advanced spatial audio reproduction formats, such as WFS. This is due to the fact that WFS needs the original source signals to be available, in order to accurately synthesize the acoustic field inside an extended listening area. Thus, an object-based mixing is required. Source separation problems in digital signal processing are those in which several signals have been mixed together and the objective is to find out what the original signals were. Therefore, SSS algorithms can be applied to existing two-channel mixtures to extract the different objects that compose the stereo scene. Unfortunately, most stereo mixtures are underdetermined, i.e., there are more sound sources than audio channels. This condition makes the SSS problem especially difficult and stronger assumptions have to be taken, often related to the sparsity of the sources under some signal transformation. This thesis is focused on the application of SSS techniques to the spatial sound reproduction field. As a result, its contributions can be categorized within these two areas. First, two underdetermined SSS methods are proposed to deal efficiently with the separation of stereo sound mixtures. These techniques are based on a multi-level thresholding segmentation approach, which enables to perform a fast and unsupervised separation of sound sources in the time-frequency domain. Although both techniques rely on the same clustering type, the features considered by each of them are related to different localization cues that enable to perform separation of either instantaneous or real mixtures.Additionally, two post-processing techniques aimed at improving the isolation of the separated sources are proposed. The performance achieved by several SSS methods in the resynthesis of WFS sound scenes is afterwards evaluated by means of listening tests, paying special attention to the change observed in the perceived spatial attributes. Although the estimated sources are distorted versions of the original ones, the masking effects involved in their spatial remixing make artifacts less perceptible, which improves the overall assessed quality. Finally, some novel developments related to the application of time-frequency processing to source localization and enhanced sound reproduction are presented.Cobos Serrano, M. (2009). Application of sound source separation methods to advanced spatial audio systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8969Palanci
    corecore