72 research outputs found

    Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation

    Get PDF
    In deep neural networks with convolutional layers, each layer typically has fixed-size/single-resolution receptive field (RF). Convolutional layers with a large RF capture global information from the input features, while layers with small RF size capture local details with high resolution from the input features. In this work, we introduce novel deep multi-resolution fully convolutional neural networks (MR-FCNN), where each layer has different RF sizes to extract multi-resolution features that capture the global and local details information from its input features. The proposed MR-FCNN is applied to separate a target audio source from a mixture of many audio sources. Experimental results show that using MR-FCNN improves the performance compared to feedforward deep neural networks (DNNs) and single resolution deep fully convolutional neural networks (FCNNs) on the audio source separation problem.Comment: arXiv admin note: text overlap with arXiv:1703.0801

    A machine-hearing system exploiting head movements for binaural sound localisation in reverberant conditions

    Get PDF
    This paper is concerned with machine localisation of multiple active speech sources in reverberant environments using two (binaural) microphones. Such conditions typically present a problem for `classical' binaural models. Inspired by the human ability to utilise head movements, the current study investigated the influence of different head movement strategies on binaural sound localisation. A machine-hearing system that exploits a multi-step head rotation strategy for sound localisation was found to produce the best performance in simulated reverberant acoustic space. This paper also reports the public release of a free binaural room impulse responses (BRIRs) database that allows the simulation of head rotation used in this study

    Assessing localization accuracy in sound field synthesis

    Get PDF

    Theory of Sound Field Synthesis

    Get PDF
    PDF version of the Theory of Sound Field Synthesis presented at http://sfstoolbox.org/en/3.0/

    Probing Speech Emotion Recognition Transformers for Linguistic Knowledge

    Get PDF
    Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.Comment: This work has been submitted for publication to Interspeech 202

    Probing speech emotion recognition transformers for linguistic knowledge

    Get PDF
    Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing

    A metric for predicting binaural speech intelligibility in stationary noise and competing speech maskers

    Get PDF
    One criterion in the design of binaural sound scenes in audio production is the extent to which the intended speech message is correctly understood. Object-based audio broadcasting systems have permitted sound editors to gain more access to the metadata (e.g., intensity and location) of each sound source, providing better control over speech intelligibility. The current study describes and evaluates a binaural distortion-weighted glimpse proportion metric -- BiDWGP -- which is motivated by better-ear glimpsing and binaural masking level differences. BiDWGP predicts intelligibility from two alternative input forms: either binaural recordings or monophonic recordings from each sound source along with their locations. Two listening experiments were performed with stationary noise and competing speech, one in the presence of a single masker, the other with multiple maskers, for a variety of spatial conïŹgurations. Overall, BiDWGP with both input forms predicts listener keyword scores with correlations of 0.95 and 0.91 for single- and multi-masker conditions, respectively. When considering masker type separately, correlations rise to 0.95 and above for both types of maskers. Predictions using the two input forms are very similar, suggesting that BiDWGP can be applied to the design of sound scenes where only individual sound sources and their locations are available

    The moving minimum audible angle is smaller during self motion than during source motion

    Get PDF
    We are rarely perfectly still: our heads rotate in three axes and move in three dimensions, constantly varying the spectral and binaural cues at the ear drums. In spite of this motion, static sound sources in the world are typically perceived as stable objects. This argues that the auditory system-in a manner not unlike the vestibulo-ocular reflex-works to compensate for self motion and stabilize our sensory representation of the world. We tested a prediction arising from this postulate: that self motion should be processed more accurately than source motion. We used an infrared motion tracking system to measure head angle, and real-time interpolation of head related impulse responses to create "head-stabilized" signals that appeared to remain fixed in space as the head turned. After being presented with pairs of simultaneous signals consisting of a man and a woman speaking a snippet of speech, normal and hearing impaired listeners were asked to report whether the female voice was to the left or the right of the male voice. In this way we measured the moving minimum audible angle (MMAA). This measurement was made while listeners were asked to turn their heads back and forth between ± 15° and the signals were stabilized in space. After this "self-motion" condition we measured MMAA in a second "source-motion" condition when listeners remained still and the virtual locations of the signals were moved using the trajectories from the first condition. For both normal and hearing impaired listeners, we found that the MMAA for signals moving relative to the head was ~1-2° smaller when the movement was the result of self motion than when it was the result of source motion, even though the motion with respect to the head was identical. These results as well as the results of past experiments suggest that spatial processing involves an ongoing and highly accurate comparison of spatial acoustic cues with self-motion cues

    Perzeptive Untersuchung der Schallfeldsynthese

    No full text
    Die vorliegende Arbeit untersucht die beiden Schallfeldsyntheseverfahren Wellenfeldsynthese und Nahfeld-entzerrtes Ambisonics höherer Ordnung. Sie fasst die Theorie der beiden Verfahren zusammen und stellt eine Software-Umgebung zur VerfĂŒgung, um beide Verfahren numerisch zu simulieren. Diskutiert werden mögliche Abweichungen der mit realen Lautsprechergruppen synthetisierten Schallfelder. Dies geschieht sowohl auf theoretischer Basis als auch in einer Reihe von psychoakustischen Experimenten. Die Experimente untersuchen dabei die rĂ€umliche und klangliche Treue und zeitlich-spektrale Artefakte der verwendeten Systeme. Systematisch wird dies fĂŒr eine große Anzahl von verschiedenen Lautsprechergruppen angewendet. Die Experimente werden mit Hilfe von dynamischer binauraler Synthese durchgefĂŒhrt, damit auch Lautsprechergruppen mit einem Abstand von unter 1 cm zwischen den Lautsprechern untersucht werden können. Die Ergebnisse zeigen, dass rĂ€umliche Treue bereits mit einem Lautsprecherabstand von 20 cm erzielt werden kann, wĂ€hrend klangliche Treue nur mit AbstĂ€nden kleiner als 1 cm möglich ist. Zeitlich-spektrale Artefakte treten nur bei der Synthese von fokussierten Quellen auf. Am Ende wird ein binaurales Modell prĂ€sentiert, welches in der Lage ist die rĂ€umliche Treue fĂŒr beliebige Lautsprechergruppen vorherzusagen.This thesis investigates the two sound field synthesis methods Wave Field Synthesis and near-field compensated higher order Ambisonics. It summarizes their theory and provides a software toolkit for corresponding numerical simulations. Possible deviations of the synthesized sound field for real loudspeaker arrays and their perceptual relevance are discussed. This is done firstly based on theoretical considerations, and then addressed in several psychoacoustic experiments. These experiments investigate the spatial and timbral fidelity and spectro-temporal-artifacts in a systematic way for a large number of different loudspeaker setups. The experiments are conducted with the help of dynamic binaural synthesis in order to simulate loudspeaker setups with an inter-loudspeaker spacing of under 1 cm. The results show that spatial fidelity can already be achieved with setups having an inter-loudspeaker spacing of 20 cm, whereas timbral fidelity is only possible for setups employing a spacing below 1 cm. Spectro-temporal artifacts are relevant only for the synthesis of focused sources. At the end of the thesis, a binaural auditory model is presented that is able to predict the spatial fidelity for any given loudspeaker setup

    Code to reproduce the figures in the paper 'Assessing localization accuracy in sound field synthesis'

    No full text
    In this upload you find all the scripts and data you need in order to reproduce the figure from the paper Wierstorf et al., "Assessing localization accuracy in sound field synthesis" [1]. ## Software Requirements ### Sound Field Synthesis Toolbox From the [Sound Field Synthesis Toolbox](https://github.com/sfstoolbox/sfs-matlab) git repository you need to checkout the version *commit 3730bc0*, which is identical with release 1.0.0. Under Linux this can be done the following way: ``` gitclonehttps://github.com/sfstoolbox/sfs−matlab.git git clone https://github.com/sfstoolbox/sfs-matlab.git cd sfs-matlab gitcheckout3730bc0 git checkout 3730bc0 cd .. ``` ### SOFA (Spatially Oriented Format for Acoustics) Matlab/Octave API From the [SOFA Matlab/Octave API](https://github.com/sofacoustics/API_MO) git repository you need to checkout the version *commit 260079a*, which is identical with release 1.0.1. Under Linux this can be done the following way: ``` gitclonehttps://github.com/sofacoustics/APIMO.gitsofa git clone https://github.com/sofacoustics/API_MO.git sofa cd sofa gitcheckout260079a git checkout 260079a cd .. ``` ### Two!Ears Auditory Front-end Form the [Two!Ears Auditory Front-end](https://github.com/TWOEARS/auditory-front-end) git repository you need to checkout the version *commit ce47b54*, which is identical with the release 1.0. Under Linux this can be done the following way: ``` gitclonehttps://github.com/TWOEARS/auditory−front−end.git git clone https://github.com/TWOEARS/auditory-front-end.git cd auditory-front-end gitcheckoutce47b54 git checkout ce47b54 cd .. ``` ## Reproduce data After downloading all needed packages start Matlab. Then we have to initialize all the toolboxes. This can be done by running the following commands from the `sfs/` and `amtoolbox/` directory, respectively. ```Matlab >> cd sfs-matlab >> SFS_start; >> cd ../sofa/API_MO >> SOFAstart; >> cd ../../auditory-front-end >> startAuditoryFrontEnd; >> cd .. ``` Now everything is prepared and you can recreate the data from the numerical simulation in Fig.1 and the ITD estimation by the binaural model in Fig.6: ```Matlab >> cd fig01 >> fig01 >> cd ../fig06 >> fig06 ``` ## Reproduce figures All figures were plotted using gnuplot 5.0. Every figure folder has an ``figXX.plt`` file that you can execute and you will get the resulting pdf file. Only Fig. 2 is generated via tikz and you have to compile the ``fig02.tex`` file using pdflatex. ## References [1] H. Wierstorf, A. Raake, S. Spors, "Assessing localization accuracy in sound field synthesis," J. Acoust. Soc. Am., 141, p. 1111-1119 (2017), doi:10.1121/1.4976061
    • 

    corecore