First-order ambisonic coding with quaternion-based interpolation of PCA rotation matrices

Abstract

International audienceConversational applications such as telephony are mostly restricted to mono. With the emergence of VR/XR applications and new products with spatial audio, there is a need to extend traditional voice and audio codecs to enable immersive communication.The present work is motivated by recent activities in 3GPP standardization around the development of a new codec called immersive voice and audio services (IVAS). The IVAS codec will address a wide variety of use cases, e.g. immersive telephony, spatial audio conferencing, live content sharing. There are two main design goals for IVAS. One goal is the versatility of the codec in terms of input (scene-based, channel-based, object-based audio…) and output (mono, stereo, binaural, various multichannel loudspeaker setups). The second goal is to re-use as much as possible and extend the enhanced voice services (EVS) mono codec.In this work, we focus on the first-order ambisonic (FOA) format which is a good candidate for the internal representation in an immersive audio codec at low bit rates, due to the flexibility of the underlying sound field decomposition. We propose a new coding method, which can extend existing core codecs such as EVS. The proposed method consists in adaptively pre-processing ambisonic components prior to multi-mono coding by a core codec.The first part of this work investigates the basic multi-mono coding approach for FOA, which is for instance used in the Opus codec (in the so-called channel mapping family 2). In this approach ambisonic components are coded separately with different instances of the (mono) core codec. We present results of a subjective test (MUSHRA), which shows that this direct approach is not satisfactory for low-bitrate coding. The signal structure is degraded which produces many spatial artifacts (e.g. wrong panning, ghost sources...). In the second part of this work, we propose a new method to exploit the correlation of ambisonic components. The pre-processing (prior to multi-mono coding) operates in time-domain to allow maximum compatibility with many codecs, especially low bit-rate codecs such as EVS and Opus, and to minimize extra delay.The proposed method applies Principal Components Analysis (PCA) on a 20 ms frame basis. For each frame, eigenvectors are computed and the eigenvector matrix is defined as a 4D rotation matrix. For complex sound scenes (with many audio sources, sudden changes…) rotation parameters may change dramatically between consecutive frames and audio sources may go from one principal component to another, which may cause discontinuities or other artifacts. Solutions such as the interpolation of eigenvectors (after inter-frame realignment) are not optimal. In the proposed method, we ensure smooth transitions between inter-frame PCA rotations thanks to two complementary methods. The first one is a matching algorithm for eigenvectors between the current and the previous frame, which avoids signal inversion and permutation across frames. The second one is an interpolation of the 4D rotation matrices in quaternion domain. We use the Cayley factorization of 4D rotation matrices into a double quaternion for the current and previous frame and apply quaternion spherical linear interpolation (QSLERP) interpolation on a subframe basis. The interpolated rotation matrices are then applied to the ambisonic components and the decorrelated components are coded with a multi-mono coding approach.We present results of a subjective evaluation (MUSHRA) for the proposed method showing that the proposed method brings significant improvements over naive multi-mono method, especially in terms of spatial quality

    Similar works