Search CORE

124 research outputs found

Effect of Vertical Microphone Layer Spacing for a 3D Microphone Array

Author: Gribben Christopher
Lee Hyunkook
Publication venue: 'Audio Engineering Society'
Publication date: 01/12/2014
Field of study

Subjective listening tests were conducted to investigate how the spacing between main (lower) and height (upper) microphone layers in a 3D main microphone array affects perceived spatial impression and overall preference. Four different layer spacings of 0m, 0.5m, 1m, and 1.5m were compared for the sound sources of trumpet, acoustic guitar, percussion quartet, and string quartet using a nine-channel loudspeaker setup. It was generally found that there was no significant difference between any of the spaced layer configurations, whereas the 0m layer had slightly higher ratings than the more spaced layers in both spatial impression and preference. Acoustical properties of the original microphone channel signals as well as those of the reproduced signals, which were binaurally recorded, were analyzed in order to find possible physical causes for the perceived results. It is suggested that the perceived results were mainly associated with vertical interchannel crosstalk in the signals of each height layer and the magnitude and pattern of spectral change at the listener’s ear caused by each layer

University of Huddersfield Repository

Huddersfield Research Portal

An audio-visual system for object-based audio : from recording to listening

Author: Coleman P
Cox TJ
de Campos T
Fazi FM
Franck A
Francombe J
Galvez MFS
Hilton A
Hughes RJ
Jackson PJB
Liu Q
Melchior F
Menzies D
Pike C
Tang Y
Woodcock JS
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/01/2018
Field of study

Object-based audio is an emerging representation for audio content, where content is represented in a reproduction format-agnostic way and, thus, produced once for consumption on many different kinds of devices. This affords new opportunities for immersive, personalized, and interactive listening experiences. This paper introduces an end-to-end object-based spatial audio pipeline, from sound recording to listening. A high-level system architecture is proposed, which includes novel audiovisual interfaces to support object-based capture and listenertracked rendering, and incorporates a proposed component for objectification, that is, recording content directly into an object-based form. Text-based and extensible metadata enable communication between the system components. An open architecture for object rendering is also proposed. The system’s capabilities are evaluated in two parts. First, listener-tracked reproduction of metadata automatically estimated from two moving talkers is evaluated using an objective binaural localization model. Second, object-based scene capture with audio extracted using blind source separation (to remix between two talkers) and beamforming (to remix a recording of a jazz group) is evaluate

University of Salford Institutional Repository

Southampton (e-Prints Soton)

University of Surrey

Surrey Research Insight

Spatial sound generation and perception by amplitude panning techniques

Author: Pulkki Ville
Publication venue: Teknillinen korkeakoulu
Publication date: 03/08/2001
Field of study

Spatial audio aims to recreate or synthesize spatial attributes when reproducing audio over loudspeakers or headphones. Such spatial attributes include, for example, locations of perceived sound sources and an auditory sense of space. This thesis focuses on new methods of spatial audio for loudspeaker listening and on measuring the quality of spatial audio by subjective and objective tests. In this thesis the vector base amplitude panning (VBAP) method, which is an amplitude panning method to position virtual sources in arbitrary 2-D or 3-D loudspeaker setups, is introduced. In amplitude panning the same sound signal is applied to a number of loudspeakers with appropriate non-zero amplitudes. With 2-D setups VBAP is a reformulation of the existing pair-wise panning method. However, differing from earlier solutions it can be generalized for 3-D loudspeaker setups as a triplet-wise panning method. A sound signal is then applied to one, two, or three loudspeakers simultaneously. VBAP has certain advantages compared to earlier virtual source positioning methods in arbitrary layouts. Previous methods either used all loudspeakers to produce virtual sources, which results in some artefacts, or they used loudspeaker triplets with a non-generalizable 2-D user interface. The virtual sources generated with VBAP are investigated. The human directional hearing is simulated with a binaural auditory model adapted from the literature. The interaural time difference (ITD) cue and the interaural level difference (ILD) cue which are the main localization cues are simulated for amplitude-panned virtual sources and for real sources. Psychoacoustic listening tests are conducted to study the subjective quality of virtual sources. Statistically significant phenomena found in listening test data are explained by auditory model simulation results. To obtain a generic view of directional quality in arbitrary loudspeaker setups, directional cues are simulated for virtual sources with loudspeaker pairs and triplets in various setups. The directional qualities of virtual sources generated with VBAP can be stated as follows. Directional coordinates used for this purpose are the angle between a position vector and the median plane (θcc), and the angle between a projection of a position vector to the median plane and frontal direction (Φcc). The perceived θcc direction of a virtual source coincides well with the VBAP panning direction when a loudspeaker set is near the median plane. When the loudspeaker set is moved towards a side of a listener, the perceived θcc direction is biased towards the median plane. The perceived Φcc direction of an amplitude-panned virtual source is individual and cannot be predicted with any panning law.reviewe

Aaltodoc Publication Archive

Optimization and improvements in spatial sound reproduction systems through perceptual considerations

Author: Gutiérrez Parera Pablo
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 07/05/2020
Field of study

[ES] La reproducción de las propiedades espaciales del sonido es una cuestión cada vez más importante en muchas aplicaciones inmersivas emergentes. Ya sea en la reproducción de contenido audiovisual en entornos domésticos o en cines, en sistemas de videoconferencia inmersiva o en sistemas de realidad virtual o aumentada, el sonido espacial es crucial para una sensación de inmersión realista. La audición, más allá de la física del sonido, es un fenómeno perceptual influenciado por procesos cognitivos. El objetivo de esta tesis es contribuir con nuevos métodos y conocimiento a la optimización y simplificación de los sistemas de sonido espacial, desde un enfoque perceptual de la experiencia auditiva. Este trabajo trata en una primera parte algunos aspectos particulares relacionados con la reproducción espacial binaural del sonido, como son la escucha con auriculares y la personalización de la Función de Transferencia Relacionada con la Cabeza (Head Related Transfer Function - HRTF). Se ha realizado un estudio sobre la influencia de los auriculares en la percepción de la impresión espacial y la calidad, con especial atención a los efectos de la ecualización y la consiguiente distorsión no lineal. Con respecto a la individualización de la HRTF se presenta una implementación completa de un sistema de medida de HRTF y se introduce un nuevo método para la medida de HRTF en salas no anecoicas. Además, se han realizado dos experimentos diferentes y complementarios que han dado como resultado dos herramientas que pueden ser utilizadas en procesos de individualización de la HRTF, un modelo paramétrico del módulo de la HRTF y un ajuste por escalado de la Diferencia de Tiempo Interaural (Interaural Time Difference - ITD). En una segunda parte sobre reproducción con altavoces, se han evaluado distintas técnicas como la Síntesis de Campo de Ondas (Wave-Field Synthesis - WFS) o la panoramización por amplitud. Con experimentos perceptuales se han estudiado la capacidad de estos sistemas para producir sensación de distancia y la agudeza espacial con la que podemos percibir las fuentes sonoras si se dividen espectralmente y se reproducen en diferentes posiciones. Las aportaciones de esta investigación pretenden hacer más accesibles estas tecnologías al público en general, dada la demanda de experiencias y dispositivos audiovisuales que proporcionen mayor inmersión.[CA] La reproducció de les propietats espacials del so és una qüestió cada vegada més important en moltes aplicacions immersives emergents. Ja siga en la reproducció de contingut audiovisual en entorns domèstics o en cines, en sistemes de videoconferència immersius o en sistemes de realitat virtual o augmentada, el so espacial és crucial per a una sensació d'immersió realista. L'audició, més enllà de la física del so, és un fenomen perceptual influenciat per processos cognitius. L'objectiu d'aquesta tesi és contribuir a l'optimització i simplificació dels sistemes de so espacial amb nous mètodes i coneixement, des d'un criteri perceptual de l'experiència auditiva. Aquest treball tracta, en una primera part, alguns aspectes particulars relacionats amb la reproducció espacial binaural del so, com són l'audició amb auriculars i la personalització de la Funció de Transferència Relacionada amb el Cap (Head Related Transfer Function - HRTF). S'ha realitzat un estudi relacionat amb la influència dels auriculars en la percepció de la impressió espacial i la qualitat, dedicant especial atenció als efectes de l'equalització i la consegüent distorsió no lineal. Respecte a la individualització de la HRTF, es presenta una implementació completa d'un sistema de mesura de HRTF i s'inclou un nou mètode per a la mesura de HRTF en sales no anecoiques. A mès, s'han realitzat dos experiments diferents i complementaris que han donat com a resultat dues eines que poden ser utilitzades en processos d'individualització de la HRTF, un model paramètric del mòdul de la HRTF i un ajustament per escala de la Diferencià del Temps Interaural (Interaural Time Difference - ITD). En una segona part relacionada amb la reproducció amb altaveus, s'han avaluat distintes tècniques com la Síntesi de Camp d'Ones (Wave-Field Synthesis - WFS) o la panoramització per amplitud. Amb experiments perceptuals, s'ha estudiat la capacitat d'aquests sistemes per a produir una sensació de distància i l'agudesa espacial amb que podem percebre les fonts sonores, si es divideixen espectralment i es reprodueixen en diferents posicions. Les aportacions d'aquesta investigació volen fer més accessibles aquestes tecnologies al públic en general, degut a la demanda d'experiències i dispositius audiovisuals que proporcionen major immersió.[EN] The reproduction of the spatial properties of sound is an increasingly important concern in many emerging immersive applications. Whether it is the reproduction of audiovisual content in home environments or in cinemas, immersive video conferencing systems or virtual or augmented reality systems, spatial sound is crucial for a realistic sense of immersion. Hearing, beyond the physics of sound, is a perceptual phenomenon influenced by cognitive processes. The objective of this thesis is to contribute with new methods and knowledge to the optimization and simplification of spatial sound systems, from a perceptual approach to the hearing experience. This dissertation deals in a first part with some particular aspects related to the binaural spatial reproduction of sound, such as listening with headphones and the customization of the Head Related Transfer Function (HRTF). A study has been carried out on the influence of headphones on the perception of spatial impression and quality, with particular attention to the effects of equalization and subsequent non-linear distortion. With regard to the individualization of the HRTF a complete implementation of a HRTF measurement system is presented, and a new method for the measurement of HRTF in non-anechoic conditions is introduced. In addition, two different and complementary experiments have been carried out resulting in two tools that can be used in HRTF individualization processes, a parametric model of the HRTF magnitude and an Interaural Time Difference (ITD) scaling adjustment. In a second part concerning loudspeaker reproduction, different techniques such as Wave-Field Synthesis (WFS) or amplitude panning have been evaluated. With perceptual experiments it has been studied the capacity of these systems to produce a sensation of distance, and the spatial acuity with which we can perceive the sound sources if they are spectrally split and reproduced in different positions. The contributions of this research are intended to make these technologies more accessible to the general public, given the demand for audiovisual experiences and devices with increasing immersion.Gutiérrez Parera, P. (2020). Optimization and improvements in spatial sound reproduction systems through perceptual considerations [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/142696TESI

Crossref

RiuNet

Surround by Sound: A Review of Spatial Audio Recording and Reproduction

Author: Abhayapala Thushara
Chen Hanchi
Samarasinghe Prasanga
Zhang Wen
Publication venue: 'MDPI AG'
Publication date: 01/01/2017
Field of study

In this article, a systematic overview of various recording and reproduction techniques for spatial audio is presented. While binaural recording and rendering is designed to resemble the human two-ear auditory system and reproduce sounds specifically for a listener’s two ears, soundfield recording and reproduction using a large number of microphones and loudspeakers replicate an acoustic scene within a region. These two fundamentally different types of techniques are discussed in the paper. A recent popular area, multi-zone reproduction, is also briefly reviewed in the paper. The paper is concluded with a discussion of the current state of the field and open problemsThe authors acknowledge National Natural Science Foundation of China (NSFC) No. 61671380 and Australian Research Council Discovery Scheme DE 150100363

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

The Australian National University

Ambisonics

Author: Frank Matthias
Zotter Franz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/02/2021
Field of study

This open access book provides a concise explanation of the fundamentals and background of the surround sound recording and playback technology Ambisonics. It equips readers with the psychoacoustical, signal processing, acoustical, and mathematical knowledge needed to understand the inner workings of modern processing utilities, special equipment for recording, manipulation, and reproduction in the higher-order Ambisonic format. The book comes with various practical examples based on free software tools and open scientific data for reproducible research. The book’s introductory section offers a perspective on Ambisonics spanning from the origins of coincident recordings in the 1930s to the Ambisonic concepts of the 1970s, as well as classical ways of applying Ambisonics in first-order coincident sound scene recording and reproduction that have been practiced since the 1980s. As, from time to time, the underlying mathematics become quite involved, but should be comprehensive without sacrificing readability, the book includes an extensive mathematical appendix. The book offers readers a deeper understanding of Ambisonic technologies, and will especially benefit scientists, audio-system and audio-recording engineers. In the advanced sections of the book, fundamentals and modern techniques as higher-order Ambisonic decoding, 3D audio effects, and higher-order recording are explained. Those techniques are shown to be suitable to supply audience areas ranging from studio-sized to hundreds of listeners, or headphone-based playback, regardless whether it is live, interactive, or studio-produced 3D audio material

Directory of Open Access Books (DOAB)

Spatial Acoustic Vector Based Sound Field Reproduction

Author: Zuo Huanyu
Publication venue
Publication date: 01/01/2021
Field of study

Spatial sound field reproduction aims to recreate an immersive sound field over a spatial region. The existing sound pressure based approaches to spatial sound field reproduction focus on the accurate approximation of original sound pressure over space, which ignores the perceptual accuracy of the reproduced sound field. The acoustic vectors of particle velocity and sound intensity appear to be closely linked with human perception of sound localization in literature. Therefore, in this thesis, we explore the spatial distributions of the acoustic vectors, and seek to develop algorithms to perceptually reproduce the original sound field over a continuous spatial region based on the vectors. A theory of spatial acoustic vectors is first developed, where the spatial distributions of particle velocity and sound intensity are derived from sound pressure. To extract the desired sound pressure from a mixed sound field environment, a 3D sound field separation technique is also formulated. Based on this theory, a series of reproduction techniques are proposed to improve the perceptual performance. The outcomes resulting from this theory are: (i) derivation of a particle velocity assisted 3D sound field reproduction technique which allows for non-uniform loudspeaker geometry with a limited number of loudspeakers, (ii) design of particle velocity based mixed-source sound field translation technique for binaural reproduction that can provide sound field translation with good perceptual experience over a large space, (iii) derivation of an intensity matching technique that can reproduce the desired sound field in a spherical region by controlling the sound intensity on the surface of the region, and (iv) two intensity based multizone sound field reproduction algorithms that can reproduce the desired sound field over multiple spatial zones. Finally, these techniques are evaluated by comparing to the conventional approaches through numerical simulations and real-world experiments

The Australian National University

The role that sound spatialization plays in improving performance in an interactive installation : study of the correlation between gesture and localization of sound sources in space

Author: Franco Diogo Leichsenring
Publication venue
Publication date: 01/01/2018
Field of study

The main objective of this research work is to study the correlation between gesture and localization of sound sources in space within the framework of interactive installations, based on theories of hearing and gesture. We have therefore chosen the experimental method by developing an interactive installation with which we carry out three different experiments, in which a subject’s hand is tracked by a Microsoft Kinect depth camera (motion capture) and a deictic gesture is used to trigger recorded music sounds and identify their localization in the horizontal plane. Thus, we manipulate the direction of sound and we measure the percentage of correct perceptual sound source localizations resulting from the participant’s responses in an Inquiry Mode Questionnaire in comparison with the actual directions of the gesture and perceptual sound sources provided by software. Descriptive and inferential statistics is applied to the collected data. The main results show that it is easier to define the origin of sound and that auditory perception is more accurate when its incidence is frontal in the horizontal plane, just as sound source localization theory predicts. Whereas 86.1% of all volunteers consider that their gesture coincides with the origin of sound in experiment 1, in which the use of their gesture in a certain direction produces a sound from that direction, only 58.1% admit the same in experiment 3, in which the same gesture is used to identify the system-predetermined localization of a perceptual sound source in an angle of 260o around a subject. At least 55.9% of all participants do not perceive that their gesture cannot coincide with the origin of sound in experiment 2, since sound is produced from the opposite surround direction, which seems to demonstrate that, when sounds are produced frontally or from the back and a person has the task of controlling their motion with a deictic gesture at the same time, his or her ability to identify the origin of sound generally diminishes, in addition to the already well-known reduced ability to identify it when it is in the median plane, if the head is not rotated. We therefore conclude that there is a relatively high correlation between gesture and localization of sound sources in space, but this is not as perfect as it could be owing to the limitations of the human auditory system and to the natural dependence of head movement on gesture.O objectivo principal deste trabalho de pesquisa é o de estudar a correlação entre gesto e localização de fontes sonoras no espaço, no âmbito das instalações interactivas, com base nas teorias da audição e do gesto. Na ocasisão em que começamos a nossa investigação verificámos que havia vários estudos que abordavam os assuntos “gesto” e “localização de fontes sonoras” de diversas maneiras: 1) de forma independente um do outro e/ou noutros contextos distintos dos das instalações interactivas, como por exemplo em Blauert (1997), Pulkki (1999) Pulkki & Karjalainen (2001), Pulkki (2001a), Bates et al. (2007), Hammershøi (2009), McNeill (1992), Coutaz & Crowley (1995), Choi (2000), Cadoz & Wanderley (2000), Nehaniv (2005), Campbell (2005), ou Godøy & Leman (2010); 2) de um ponto de vista mais técnico, como por exemplo em Harada et al. (1992), Jensenius et al. (2006), Marshall et al. (2006), Schacher (2007), Neukom & Schacher (2008), Zelli (2009), Marshall et al. (2009), Bhuiyan & Picking (2009), ou Schumacher & Bresson (2010); ou 3) de um ponto de vista mais artístico, como em Bencina et al. (2008) ou Grigoriou & Floros (2010). Havia, no entanto, muito poucos estudos a envolver ou a abordar ambos os assuntos e a analisar de maneira conjugada as suas relações de um ponto de vista mais perceptual, como por exemplo em Gröhn (2002), de Götzen (2004) ou Marentakis et al. (2008). Foi esta última perspectiva que decidimos seguir e que aqui exploramos. Desta forma, optámos pelo método experimental, aplicando um desenho de medidas repetidas e desenvolvendo uma instalação interactiva com a qual realizamos três experiências diferentes, em que a mão de um sujeito é rastreada por uma câmara de profundidade Microsoft Kinect (captação de movimento) e um gesto díctico é usado para activar sons de música gravada e para identificar as suas localizações no plano de escuta horizontal. Assim, manipulamos a direcção do som e medimos a percentagem de localizações de fontes sonoras perceptuais correctas, resultante das respostas dos participantes num Inquérito Por Questionário em comparação com as direcções reais do gesto díctico e das fontes sonoras perceptuais fornecidas pelo software que utilizamos no nosso trabalho. Para população-alvo pensámos em pessoas com conhecimentos musicais e pessoas com poucos ou nenhuns conhecimentos musicais, o que nos levou a solicitar a um grande número de pessoas a sua participação voluntária, anónima e sem constrangimentos no nosso estudo. Isso foi levado a cabo sobretudo através do envio de correio electrónico para amigos, para estudantes de diferentes áreas a frequentar e para colegas a trabalhar na Escola de Artes da Universidade Católica Portuguesa (EA- -UCP), na Escola Superior de Música e Artes do Espetáculo do Instituto Politécnico do Porto e na Academia de Música de Espinho. Para além disso, foi também crucial falar-se com amigos e familiares e informar tantas pessoas quanto possíıvel sobre a nossa investigação, através da colocação de cartazes informativos nas paredes dos corredores da Universidade Católica, alguns dias antes de as experiências terem sido realizadas no Laboratório de Captação de Movimento da EA-UCP. Por fim, é efectuada uma análise estatística descritiva e inferencial dos dados recolhidos. Os principais resultados apontam no sentido de ser mais fácil definir a origem do som quando a sua incidência é frontal no plano de escuta horizontal, para além de a percepção auditiva ser mais precisa nessa direcção, tal como a teoria da localização de fontes sonoras prevê. Enquanto 86.1% de todos os participantes consideram que o seu gesto díctico coincide com a origem do som na experiência 1, em que o uso desse gesto numa determinada direcção faz despoletar um som proveniente dessa direcção, apenas 58.1% admitem o mesmo na experiência 3, em que o mesmo gesto é usado para identificar a localização de uma fonte sonora perceptual predeterminada pelo sistema num ângulo de 260º em torno de um sujeito. Esta última percentagem parece dever-se ao facto de a maior parte dos sons ser produzida a partir de direcções laterais na experiência 3, tendo a posição da cabeça voltada para a câmara como referência. Pelo menos 55.9% de todos os voluntários não percebem que o seu gesto não poderia ter coincidido com a origem do som na experiência 2, já que o som é produzido a partir da direcção envolvente oposta. Este facto parece demonstrar que, quando os sons são produzidos frontalmente ou de trás e uma pessoa tem a tarefa de controlar os seus movimentos com um gesto díctico ao mesmo tempo, a sua capacidade para identificar a origem do som é, em geral, ainda mais baixa, para além da já conhecida capacidade reduzida para identificá-la quando o som se encontra no plano mediano, se a cabeça não for rodada. A maior parte dos participantes sente um controlo imediato sobre o som nas experiências 1 e 2, mas os tempos estimados pelos próprios são bastante superiores aos aproximadamente 650 milissegundos necessários para o ser humano ouvir e reagir a um som na nossa instalação interactiva. Descobrimos também que o tempo médio necessário para localizar sons com o uso de um gesto díctico na nossa experiência 3 é de cerca de 10 segundos, o que corresponde a um tempo bastante mais longo do que os 3 segundos que supusemos. Para além disso, os voluntários fazem em média 2 tentativas para localizar sons com os seus gestos dícticos, tendo a necessidade de ouvir apenas uma vez em média cada som na íntegra para o localizar. Os desvios à esquerda e à direita efectuados pela maior parte dos participantes relativamente às direcções verdadeiras do som, quando estes tentam identificar as localizações predeterminadas pelo sistema das fontes sonoras perceptuais com os seus gestos dícticos na zona periférica do corpo, são em média de 7.97º e -7.19º, respectivamente. Desta forma, o desvio médio absoluto é de 7.76º. Comparando esses desvios com aqueles levados a cabo pelos participantes usando a mão esquerda (desvios de 6.86o para a esquerda e -6.35º para a direita das direcções verdadeiras do som) e com aqueles usando a mão direita (desvios de 8.46º para a esquerda e -7.38º para a direita das direcções verdadeiras do som), concluímos que os resultados são bastante parecidos entre si. Descobrimos que a maior parte dos voluntários estima um tempo muito mais longo do que os 2 segundos que supusemos experimentalmente para entender cada uma das três experiências. Para além disso, esse tempo estimado pelos participantes diminui da primeira para a última experiência, aparentemente devido à familiarização, conscientemente provocada por nós através da mesma sequência de realização das experiências imposta a cada participante, com o nosso sistema interactivo, embora considerem ter entendido cada uma das três experiências rapidamente. Acresce que a maioria dos voluntários interage facilmente com a nossa instalação e concorda que o gesto sugerido por nós foi adequadamente seleccionado para qualquer uma das três experiências. Também constatamos que os participantes consideram a resposta do sistema ao gesto como sendo imediata nas nossas três experiências, ou seja, estimam cerca de 1 segundo, o que é consistente com o resultado da medição da latência do sistema de cerca de 470 milissegundos. Além disso, verificamos que a maioria dos voluntários se sente envolvida pelo som na nossa instalação interactiva usando Ambisonics Equivalent Panning. Portanto, concluímos que, usando uma instalação interactiva como a nossa com um público-alvo semelhante aquele que tivemos, há uma correlação relativamente elevada entre o gesto e a localização de fontes sonoras no espaço, mas que esta não é tão perfeita como poderia ser devido às limitações do nosso sistema auditivo e aparentemente à dependência natural do movimento da cabeça do gesto. Assim, parece que a espacialização sonora pode melhorar o desempenho numa instalação interactiva, mas de forma moderada. Mesmo assim, defendemos que um sistema como o nosso pode vir a ser aplicado com vantagem em domínios diversos como os que apresentamos como exemplos

Repositório Institucional da Universidade Católica Portuguesa

Application of sound source separation methods to advanced spatial audio systems

Author: Cobos Serrano Máximo
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 03/12/2010
Field of study

This thesis is related to the field of Sound Source Separation (SSS). It addresses the development and evaluation of these techniques for their application in the resynthesis of high-realism sound scenes by means of Wave Field Synthesis (WFS). Because the vast majority of audio recordings are preserved in twochannel stereo format, special up-converters are required to use advanced spatial audio reproduction formats, such as WFS. This is due to the fact that WFS needs the original source signals to be available, in order to accurately synthesize the acoustic field inside an extended listening area. Thus, an object-based mixing is required. Source separation problems in digital signal processing are those in which several signals have been mixed together and the objective is to find out what the original signals were. Therefore, SSS algorithms can be applied to existing two-channel mixtures to extract the different objects that compose the stereo scene. Unfortunately, most stereo mixtures are underdetermined, i.e., there are more sound sources than audio channels. This condition makes the SSS problem especially difficult and stronger assumptions have to be taken, often related to the sparsity of the sources under some signal transformation. This thesis is focused on the application of SSS techniques to the spatial sound reproduction field. As a result, its contributions can be categorized within these two areas. First, two underdetermined SSS methods are proposed to deal efficiently with the separation of stereo sound mixtures. These techniques are based on a multi-level thresholding segmentation approach, which enables to perform a fast and unsupervised separation of sound sources in the time-frequency domain. Although both techniques rely on the same clustering type, the features considered by each of them are related to different localization cues that enable to perform separation of either instantaneous or real mixtures.Additionally, two post-processing techniques aimed at improving the isolation of the separated sources are proposed. The performance achieved by several SSS methods in the resynthesis of WFS sound scenes is afterwards evaluated by means of listening tests, paying special attention to the change observed in the perceived spatial attributes. Although the estimated sources are distorted versions of the original ones, the masking effects involved in their spatial remixing make artifacts less perceptible, which improves the overall assessed quality. Finally, some novel developments related to the application of time-frequency processing to source localization and enhanced sound reproduction are presented.Cobos Serrano, M. (2009). Application of sound source separation methods to advanced spatial audio systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8969Palanci

RiuNet