12 research outputs found

    Deep-sound field analysis for upscaling ambisonic signals

    Get PDF
    International audienceHigher Order Ambisonics (HOA) is a popular technique used in high quality spatial audio reproduction. Several time and frequency domain methods which exploit sparsity have been proposed in the literature. These methods exploit sparsity and an overcomplete spherical harmonics dictionary is used to compute the DOA of the source. Spherical harmonic decomposition has also been used to render the spatial sound. However, the desired sound field can be reproduced over a small 
reproduction area at lower ambisonic orders. Additionally, this technique is limited by low spatial resolution which can be improved by increasing the number of loudspeakers during spatial sound reproduction. An increase in the number of loudspeakers is not a good choice since it involves solving an underdetermined system of equations for improving spatial resolution. A joint method that upscales the Ambisonics order while simultaneously increasing the number of loudspeakers is a feasible solution to this problem. Deep Neural Networks have hitherto not been investigated in detail in the context of upscaling ambisonics.In this work, a novel Sequential Multi-Stage DNN (SMS-DNN) is developed for upscaling Ambisonic signals. The SMS-DNN consists of sequentially stacked DNNs, where each of the stacked DNN upscales the order of the signal by one. This DNN structure is motivated by the fact that the spherical components of the encoded signal are independent of each other. Additionally for a particular direction <latex>(Īø, Ļ†)</latex> of the sound source, increase in the spherical harmonic order only appends higher order spherical harmonic coefficients to the encoder of the previous order, while the lower order spherical harmonic coefficients remain unchanged. Hence the individual DNNs in the SMS-DNN can be trained independently for any upscaling order.Monophonic sound is acquired using a B-format (first order) ambisonic microphone. These signals are upscaled into order-N HOA encoded plane wave sounds using the SMS-DNN in this work. The SMS-DNN allows for training of a very large number of layers since training is performed in blocks consisting of a fixed number of layers. Hence each stage can be trained independently. Additionally, the vanishing gradient problem in DNN with a large number of layers is also effectively handled by the proposed SMS-DNN due to its sequential nature. This method does not require prior estimation of the source locations and works in multiple source scenarios.Experiments on ambisonics upscaling are conducted to evaluate the performance of the proposed method. The SMS-DNN architecture used in the experiment consists of N-1 fully connected feedforward neural networks where each network is trained separately. Here N is the ambisonics order up to which upscaling needs to be performed. An input training dataset where each example is a combination of five randomly located sound sources is also developed for the purpose of training the SMS-DNN. The output training dataset consists of a higher order encoding of the same mixture of sounds with similar locations as input data. Reconstructed sound field analysis, subjective and objective evaluations conducted on the upscaled Ambisonic sound scenes. Mean squared Error analysis of upscaled higher order reproduced fields indicates an error of up to -10dB. As the order of upscaling is increased it is noted that error-free reproduction area (sweet spot) increases. Average error distribution plots are also used to indicate the significance of the proposed method. MUSHRA tests, MOS (subjective evaluation) and PEAQ tests (objective evaluation) are also illustrated to indicate the perceptual quality of the reproduced sounds when compared to benchmark HOA reproduction

    Parametric first-order ambisonic decoding for headphones utilising the cross-pattern coherence algorithm

    Get PDF
    International audienceRegarding the reproduction of recorded or synthesised spatial sound scenes, perhaps the most convenient and flexible approach is to employ the Ambisonics framework. The Ambisonics framework allows for linear and non-parametric storage, manipulation and reproduction of sound-fields, described using spherical harmonics up to a given order of expansion. Binaural Ambisonic reproduction can be realised by matching the spherical harmonic patterns to a set of binaural filters, in manner which is frequency-dependent, linear and time-invariant. However, the perceptual performance of this approach is largely dependent on the spatial resolution of the input format. When employing lower-order material as input, perceptual deficiencies may easily occur, such as poor localisation accuracy and colouration. This is especially problematic, as the vast majority of existing Ambisonic recordings are often made available as first-order only. The detrimental effects associated with lower-order Ambisonics reproduction have been well studied and documented. To improve upon the perceived spatial accuracy of the method, the simplest solution is to increase the spherical harmonic order at the recording stage. However, microphone arrays capable of capturing higher-order components, are generally much more expensive than first-order arrays; while more affordable options tend to offer higher-order components only at limited frequency ranges. Additionally, an increase in spherical harmonic order also requires an increase in the number of channels and storage, and in the case of transmission, more bandwidth is needed. Furthermore, it is important to note that this solution does not aid in the reproduction of existing lower-order recordings. It is for these reasons that this work focuses on alternative methods which improve the reproduction of first-order material for headphone playback. For the task of binaural sound-field reproduction, an alternative is to employ a parametric approach, which divides the sound-field decoding into analysis and synthesis stages. Unlike Ambisonic reproduction, which operates via a linear combination of the input signals, parametric approaches operate in the time-frequency domain and rely on the extraction of spatial parameters during their analysis stage. These spatial parameters are then utilised to conduct a more informed reproduction in the synthesis stage. Parametric methods are capable of reproducing sounds at a spatial resolution that far exceeds their linear and time-invariant counterparts, as they are not bounded by the resolution of the input format. For example, they can elect to directly convolve the analysed source signals with Head-Related Transfer Functions (HRTF), which correspond to their analysed directions. An infinite order of spherical harmonic components would be required to attain the same resolution with a binaural Ambisonic decoder. The most well-known and established parametric reproduction method is Directional Audio Coding (DirAC), which employs a sound-field model consisting of one plane-wave and one diffuseness estimate per time-frequency tile. These parameters are derived from the active-intensity vector, in the case of first-order input. More recent formulations allow for multiple plane-wave and diffuseness estimates via spatially-localised active-intensity vectors, using higher-order input. Another parametric method is High Angular Resolution plane-wave Expansion (HARPEX), which extracts two plane-waves per frequency and is first-order only. The Sparse-Recovery method extracts a number of plane-waves, which corresponds to up to half the number of input channels of arbitrary order. The COding and Multi-Parameterisation of Ambisonic Sound Scenes (COMPASS) method also extracts source components up to half the number of input channels, but employs an additional residual stream that encapsulates the remaining diffuse and ambient components in the scene. In this paper, a new binaural parametric decoder for first-order input is proposed. The method employs a sound-field model of one plane-wave and one diffuseness estimate per frequency, much like the DirAC model. However, the source component directions are identified via a plane-wave decomposition using a dense scanning grid and peak-finding, which is shown to be more robust than the active-intensity vector for multiple narrow-band sources. The source and ambient components per time-frequency tile are then segregated, and their relative energetic contributions are established, using the Cross-Pattern Coherence (CroPaC) spatial-filter. This approach is shown to be more robust than deriving this energy information from the active-intensity-based diffuseness estimates. A real-time audio plug-in implementation of the proposed approach is also described.A multiple-stimulus listening test was conducted to evaluate the perceived spatial accuracy and fidelity of the proposed method, alongside both first-order and third-order Ambisonics reproduction. The listening test results indicate that the proposed parametric decoder, using only first-order signals, is capable of delivering perceptual accuracy that matches or surpasses that of third-order ambisonics decoding

    Proceedings of the EAA Spatial Audio Signal Processing symposium: SASP 2019

    Get PDF
    International audienc

    High Frequency Reproduction in Binaural Ambisonic Rendering

    Get PDF
    Humans can localise sounds in all directions using three main auditory cues: the differences in time and level between signals arriving at the left and right eardrums (interaural time difference and interaural level difference, respectively), and the spectral characteristics of the signals due to reflections and diffractions off the body and ears. These auditory cues can be recorded for a position in space using the head-related transfer function (HRTF), and binaural synthesis at this position can then be achieved through convolution of a sound signal with the measured HRTF. However, reproducing soundfields with multiple sources, or at multiple locations, requires a highly dense set of HRTFs. Ambisonics is a spatial audio technology that decomposes a soundfield into a weighted set of directional functions, which can be utilised binaurally in order to spatialise audio at any direction using far fewer HRTFs. A limitation of low-order Ambisonic rendering is poor high frequency reproduction, which reduces the accuracy of the resulting binaural synthesis. This thesis presents novel HRTF pre-processing techniques, such that when using the augmented HRTFs in the binaural Ambisonic rendering stage, the high frequency reproduction is a closer approximation of direct HRTF rendering. These techniques include Ambisonic Diffuse-Field Equalisation, to improve spectral reproduction over all directions; Ambisonic Directional Bias Equalisation, to further improve spectral reproduction toward a specific direction; and Ambisonic Interaural Level Difference Optimisation, to improve lateralisation and interaural level difference reproduction. Evaluation of the presented techniques compares binaural Ambisonic rendering to direct HRTF rendering numerically, using perceptually motivated spectral difference calculations, auditory cue estimations and localisation prediction models, and perceptually, using listening tests assessing similarity and plausibility. Results conclude that the individual pre-processing techniques produce modest improvements to the high frequency reproduction of binaural Ambisonic rendering, and that using multiple pre-processing techniques can produce cumulative, and statistically significant, improvements

    Dislocations in sound design for 3-d films: sound design and the 3-d cinematic experience

    No full text
    Since the success of James Cameronā€™s Avatar (2009),1 the feature film industry has embraced 3-D feature film technology. With 3-D films now setting a new benchmark for contemporary cinemagoers, the primary focus is directed towards these new stunning visuals. Sound is often neglected until the final filmmaking process as the visuals are taking up much of the film budget. 3-D has changed the relationship between the imagery and the accompanying soundtrack, losing aspects of the cohesive union compared with 2-D film. Having designed sound effects on Australiaā€™s first digital animated 3-D film, Legend of the Guardians: The Owls of Gaā€™Hoole (2010),2 and several internationally released 3-D films since, it became apparent to me that the visuals are evolving technologically and artistically at a rate far greater than the soundtrack. This is creating a dislocation between the image and the soundtrack. Although cinema sound technology companies are trialing and releasing new ā€˜immersiveā€™ technologies, they are not necessarily addressing the spatial relationship between the images and soundtracks of 3-D digital films. Through first hand experience, I question many of the working methodologies currently employed within the production and creation of the soundtrack for 3-D films. There is limited documentation on sound design within the 3-D feature film context, and as such, there are no rules or standards associated with this new practice. Sound designers and film sound mixers are continuing to use previous 2-D work practices in cinema sound, with limited and cautious experimentation of spatial sound design for 3-D. Although emerging technologies are capable of providing a superior and ā€˜more immersiveā€™ soundtrack than previous formats, this does not necessarily mean that they provide an ideal solution for 3-D film. Indeed the film industry and cinema managers are showing some resistance in adopting these technologies, despite the push from technology vendors. Through practice-led research, I propose to research and question the following:Does the contemporary soundtrack suit 3-D films? ; Has sound technology used in 2-D film changed with the introduction of 3-D film? If it has, is this technology an ideal solution, or are further technical developments needed to allow greater creativity and cohesiveness of 3-D film sound design? ; How might industry practices need to develop in order to accommodate the increased dimension and image depth of 3-D visuals? ; Does a language exist to describe spatial sound design in 3-D cinema? ; What is the audience awareness of emerging film technologies? And what does this mean for filmmakers and the cinema? ; Looking beyond contemporary cinema practices, is there an alternative approach to creating a soundtrack that better represents the accompanying 3-D imagery

    Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

    Get PDF
    Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture

    METROPOLITAN ENCHANTMENT AND DISENCHANTMENT. METROPOLITAN ANTHROPOLOGY FOR THE CONTEMPORARY LIVING MAP CONSTRUCTION

    Get PDF
    We can no longer interpret the contemporary metropolis as we did in the last century. The thought of civil economy regarding the contemporary Metropolis conflicts more or less radically with the merely acquisitive dimension of the behaviour of its citizens. What is needed is therefore a new capacity for imagining the economic-productive future of the city: hybrid social enterprises, economically sustainable, structured and capable of using technologies, could be a solution for producing value and distributing it fairly and inclusively. Metropolitan Urbanity is another issue to establish. Metropolis needs new spaces where inclusion can occur, and where a repository of the imagery can be recreated. What is the ontology behind the technique of metropolitan planning and management, its vision and its symbols? Competitiveness, speed, and meritocracy are political words, not technical ones. Metropolitan Urbanity is the characteristic of a polis that expresses itself in its public places. Today, however, public places are private ones that are destined for public use. The Common Good has always had a space of representation in the city, which was the public space. Today, the Green-Grey Infrastructure is the metropolitan city's monument that communicates a value for future generations and must therefore be recognised and imagined; it is the production of the metropolitan symbolic imagery, the new magic of the city

    A frequency-domain algorithm to upscale ambisonic sound scenes

    No full text
    corecore