    First-order ambisonic coding with quaternion-based interpolation of PCA rotation matrices

    International audienceConversational applications such as telephony are mostly restricted to mono. With the emergence of VR/XR applications and new products with spatial audio, there is a need to extend traditional voice and audio codecs to enable immersive communication.The present work is motivated by recent activities in 3GPP standardization around the development of a new codec called immersive voice and audio services (IVAS). The IVAS codec will address a wide variety of use cases, e.g. immersive telephony, spatial audio conferencing, live content sharing. There are two main design goals for IVAS. One goal is the versatility of the codec in terms of input (scene-based, channel-based, object-based audio…) and output (mono, stereo, binaural, various multichannel loudspeaker setups). The second goal is to re-use as much as possible and extend the enhanced voice services (EVS) mono codec.In this work, we focus on the first-order ambisonic (FOA) format which is a good candidate for the internal representation in an immersive audio codec at low bit rates, due to the flexibility of the underlying sound field decomposition. We propose a new coding method, which can extend existing core codecs such as EVS. The proposed method consists in adaptively pre-processing ambisonic components prior to multi-mono coding by a core codec.The first part of this work investigates the basic multi-mono coding approach for FOA, which is for instance used in the Opus codec (in the so-called channel mapping family 2). In this approach ambisonic components are coded separately with different instances of the (mono) core codec. We present results of a subjective test (MUSHRA), which shows that this direct approach is not satisfactory for low-bitrate coding. The signal structure is degraded which produces many spatial artifacts (e.g. wrong panning, ghost sources...). In the second part of this work, we propose a new method to exploit the correlation of ambisonic components. The pre-processing (prior to multi-mono coding) operates in time-domain to allow maximum compatibility with many codecs, especially low bit-rate codecs such as EVS and Opus, and to minimize extra delay.The proposed method applies Principal Components Analysis (PCA) on a 20 ms frame basis. For each frame, eigenvectors are computed and the eigenvector matrix is defined as a 4D rotation matrix. For complex sound scenes (with many audio sources, sudden changes…) rotation parameters may change dramatically between consecutive frames and audio sources may go from one principal component to another, which may cause discontinuities or other artifacts. Solutions such as the interpolation of eigenvectors (after inter-frame realignment) are not optimal. In the proposed method, we ensure smooth transitions between inter-frame PCA rotations thanks to two complementary methods. The first one is a matching algorithm for eigenvectors between the current and the previous frame, which avoids signal inversion and permutation across frames. The second one is an interpolation of the 4D rotation matrices in quaternion domain. We use the Cayley factorization of 4D rotation matrices into a double quaternion for the current and previous frame and apply quaternion spherical linear interpolation (QSLERP) interpolation on a subframe basis. The interpolated rotation matrices are then applied to the ambisonic components and the decorrelated components are coded with a multi-mono coding approach.We present results of a subjective evaluation (MUSHRA) for the proposed method showing that the proposed method brings significant improvements over naive multi-mono method, especially in terms of spatial quality

    The Bits of Silence : Redundant Traffic in VoIP

    Human conversation is characterized by brief pauses and so-called turn-taking behavior between the speakers. In the context of VoIP, this means that there are frequent periods where the microphone captures only background noise – or even silence whenever the microphone is muted. The bits transmitted from such silence periods introduce overhead in terms of data usage, energy consumption, and network infrastructure costs. In this paper, we contribute by shedding light on these costs for VoIP applications. We systematically measure the performance of six popular mobile VoIP applications with controlled human conversation and acoustic setup. Our analysis demonstrates that significant savings can indeed be achievable - with the best performing silence suppression technique being effective on 75% of silent pauses in the conversation in a quiet place. This results in 2-5 times data savings, and 50-90% lower energy consumption compared to the next better alternative. Even then, the effectiveness of silence suppression can be sensitive to the amount of background noise, underlying speech codec, and the device being used. The codec characteristics and performance do not depend on the network type. However, silence suppression makes VoIP traffic network friendly as much as VoLTE traffic. Our results provide new insights into VoIP performance and offer a motivation for further enhancements, such as performance-aware codec selection, that can significantly benefit a wide variety of voice assisted applications, as such intelligent home assistants and other speech codec enabled IoT devices.Peer reviewe

    Transmission de la voix sur des liens sans fil IEEE 802.15.4

    National audienceL'évolution de la technologie au niveau des composants électroniques et informatiques a permis la naissance de plusieurs projets comme CANet qui a comme but l'amélioration des conditions de vie, particulièrement pour les personnes âgées. Ce projet permet aux personnes âgées de conserver les mêmes conditions de vie, tout en surveillant leur état de santé sans les déranger. Une des fonctionnalités de CANet est de dialoguer oralement avec une personne via sa canne de marche en utilisant un lien sans fil. Dans ce papier, nous étudions le meilleur codec qui doit être utilisé avec la norme IEEE 802.15.4

    Opus audiokoodekki matkapuhelinverkoissa

    The latest generations in mobile networks have enabled a possibility to include high quality audio coding in data transmission. On the other hand, an on-going effort to move the audio signal processing from dedicated hardware to data centers with generalized hardware introduces a challenge of providing enough computational power needed by the virtualized network elements. This thesis evaluates the usage of a modern hybrid audio codec called Opus in a virtualized network element. It is performed by integrating the codec, testing it for functionality and performance on a general purpose processor, as well as evaluating the performance in comparison to the digital signal processor's performance. Functional testing showed that the codec was integrated successfully and bit compliance with the Opus standard was met. The performance results showed that although the digital signal processor computes the encoder's algorithms with less clock cycles, related to the processor's whole capacity the general purpose processor performs more efficiently due to higher clock frequency. For the decoder this was even clearer, when the generic hardware spends on average less clock cycles for performing the algorithms.Uusimmat sukupolvet matkapuhelinverkoissa mahdollistavat korkealaatuisen audiokoodauksen tiedonsiirrossa. Toisaalta audiosignaalinkäsittelyn siirtäminen sovelluskohtaisesta laitteistosta keskitettyjen palvelinkeskusten yleiskäyttöiseen laitteistoon on käynnissä, mikä aiheuttaa haasteita tarjota riittävästi laskennallista tehoa virtualisoituja verkkoelementtejä varten. Tämä diplomityö arvioi modernin hybridikoodekin, Opuksen, käyttöä virtualisoidussa verkkoelementissä. Se on toteutettu integroimalla koodekki, testaamalla funktionaalisuutta ja suorituskykyä yleiskäyttöisellä prosessorilla sekä arvioimalla suorituskykyä verrattuna digitaalisen signaaliprosessorin suorituskykyyn. Funktionaalinen testaus osoitti että koodekki oli integroitu onnistuneesti ja että bittitason yhdenmukaisuus Opuksen standardin kanssa saavutettiin. Suorituskyvyn testitulokset osoittivat, että vaikka enkoodaus tuotti vähemmän kellojaksoja digitaalisella signaaliprosessorilla, yleiskäyttöinen prosessori suoriutuu tehokkaammin suhteutettuna prosessorin kokonaiskapasiteettiin korkeamman kellotaajuuden ansiosta. Dekooderilla tämä näkyi vielä selkeämmin, sillä yleiskäyttöinen prosessori kulutti keskimäärin vähemmän kellojaksoja algoritmien suorittamiseen

    Acoustic compression in Zoom audio does not compromise voice recognition performance

    Human voice recognition over telephone channels typically yields lower accuracy when compared to audio recorded in a studio environment with higher quality. Here, we investigated the extent to which audio in video conferencing, subject to various lossy compression mechanisms, affects human voice recognition performance. Voice recognition performance was tested in an old–new recognition task under three audio conditions (telephone, Zoom, studio) across all matched (familiarization and test with same audio condition) and mismatched combinations (familiarization and test with different audio conditions). Participants were familiarized with female voices presented in either studio-quality (N = 22), Zoom-quality (N = 21), or telephone-quality (N = 20) stimuli. Subsequently, all listeners performed an identical voice recognition test containing a balanced stimulus set from all three conditions. Results revealed that voice recognition performance (dʹ) in Zoom audio was not significantly different to studio audio but both in Zoom and studio audio listeners performed significantly better compared to telephone audio. This suggests that signal processing of the speech codec used by Zoom provides equally relevant information in terms of voice recognition compared to studio audio. Interestingly, listeners familiarized with voices via Zoom audio showed a trend towards a better recognition performance in the test (p = 0.056) compared to listeners familiarized with studio audio. We discuss future directions according to which a possible advantage of Zoom audio for voice recognition might be related to some of the speech coding mechanisms used by Zoom

    Emotion Recognition from Speech with Acoustic, Non-Linear and Wavelet-based Features Extracted in Different Acoustic Conditions

    ABSTRACT: In the last years, there has a great progress in automatic speech recognition. The challenge now it is not only recognize the semantic content in the speech but also the called "paralinguistic" aspects of the speech, including the emotions, and the personality of the speaker. This research work aims in the development of a methodology for the automatic emotion recognition from speech signals in non-controlled noise conditions. For that purpose, different sets of acoustic, non-linear, and wavelet based features are used to characterize emotions in different databases created for such purpose

    Multi-core platforms for audio and multimedia coding algorithms in telecommunications

    Tietoliikenteessä käytettävät multimedian koodausalgoritmit eli koodekit kehittyvät jatkuvasti. USAC ja Opus ovat uusia, sekä puheelle että musiikille soveltuvia audiokoodekkeja. Molemmat ovat sijoittuneet korkealle koodekkien äänenlaatua vertailevissa tutkimuksissa. Näiden keskeisiä ominaisuuksia käsitellään kirjallisuuskatsaukseen perustuen. Varsinkin HD-tasoisen videon käsittelyssä käytettävät koodekit vaativat suurta laskentatehoa. Tilera TILEPro64 -moniydinsuorittimen ja sille optimoitujen multimediakoodekkien suorituskykyä testattiin tarkoitukseen suunnitelluilla tietokoneohjelmilla. Tulokset osoittivat, että suoritinytimiä lisättäessä videon koodausalgoritmien suoritusnopeus kasvaa tiettyyn rajaan asti. Testatuilla äänen koodausalgoritmeillä ytimien lisääminen ei parantanut suoritusnopeutta. Tileran moniydinratkaisuja verrattiin lopuksi Freescalen ja Texas Instrumentsin moniydinratkaisuihin. Huolimatta eroista laitteistoarkkitehtuureissa, kyseisten toimittajien kehitystyökaluissa todettiin olevan paljon samoja piirteitä.Multimedia coding algorithms used in telecommunications evolve constantly. Benefits and properties of two new hybrid audio codecs (USAC, Opus) were reviewed on a high level as a literature study. It was found that both have succeeded well in subjective sound quality measurements. Tilera TILEPro64-multicore platform and a related software library was evaluated in terms of performance in multimedia coding. The performance in video coding was found to increase with the number of processing cores up to a certain point. With the tested audio codecs, increasing the number of cores did not increase coding performance. Additionally, multicore products of Tilera, Texas Instruments and Freescale were compared. Development tools of all three vendors were found to have similar features, despite the differences in hardware architectures

    Implementasi Unified Modeling Language (UML) pada Perancangan Aplikasi WiFiTalkie Berbasis TCP/IP

    Di dunia komunikasi analog kita mengenal perangkat yang bernama HT (Handy Talkie). Perangkat ini bekerja dengan menggunakan sinyal elektromagnetik pada frekuensi radio tertentu. Perangkat ini berfungsi sebagai pengirim dan penerima sinyal radio.Sinyal yang dikirimkan adalah sinyal suara yang telah diubah menjadi sinyal elektromagnetik. Untuk dapat berkomunikasi satu sama lain, maka harus ada kesepakatan antar pengguna untuk menyetel perangkatnya pada frekuensi yang sama. Salah satu kelemahan dari perangkat ini adalah kualitas suara yang cenderung noisy dan sangat bergantung pada kondisi cuaca. Dengan seiring teknologi semikonduktor yang berkembang pesat, terciptalah perangkat digital yang semakin bervariasi kegunaannya. Saat ini sudah banyak diciptakan perangkat digital yang dapat menggantikan perangkat analog secara keseluruhan. Sebagai contohnya adalah pesawat televisi. Pesawat televisi digital memberikan kualitas yang jauh lebih baik daripada perangkat televisi analog dengan ukuran yang jauh lebih ramping. Contoh yang lain saat ini tersedia smartphone yang memiliki fitur yang sangat lengkapyang tertanam pada perangkat yang berukuran relatif kecil. Salah satu fiturnya adalah wifi. Dengan fitur ini sebuah smartphone dapat terhubung satu sama lain, bahkan dapat terhubung dengan internet dengan mudahnya. Dalam rangka digitalisasi perangkat analog dan tersedianya fiturwifidi dalam smartphone ini, maka diciptakan sebuah aplikasi WiFiTalkie. Cara kerjanya yaitu smartphone akan mengirimkan sinyal suara dengan memanfaatkan teknologi TCP/IP melalui jaringan wifi, kemudian smartphone yang lain di dalam network yang sama menerima sinyal ini dan memprosesnya kembali menjadi sinyal suara. Aplikasi ini dibangun dengan mengimplementasikan metode perancangan Unified Modeling Language (UML). Hasil penelitian ini menunjukkan bahwa kualitas suara yang dihasilkan oleh WiFiTalkie jauh lebih baik daripada HT yang berbasis pada sinyal analog

    Web-Based Networked Music Performances via WebRTC: A Low-Latency PCM Audio Solution

    Nowadays, widely used videoconferencing software has been diffused even further by the social distancing measures adopted during the SARS-CoV-2 pandemic. However, none of the Web-based solutions currently available support high-fidelity stereo audio streaming, which is a fundamental prerequisite for networked music applications. This is mainly because of the fact that the WebRTC RTCPeerConnection standard or Web-based audio streaming do not handle uncompressed audio formats. To overcome that limitation, an implementation of 16-bit pulse code modulation (PCM) stereo audio transmission on top of the WebRTC RTCDataChannel, leveraging Web Audio and AudioWorklets, is discussed. Results obtained with multiple configurations, browsers, and operating systems showthat the proposed approach outperforms theWebRTC RTCPeerConnection standard in terms of audio quality and latency, which in the authors' best case to date has been reduced to only 40 ms between twoMacBooks on a local area network