72 research outputs found
Multi-Resolution Fully Convolutional Neural Networks for Monaural Audio Source Separation
In deep neural networks with convolutional layers, each layer typically has
fixed-size/single-resolution receptive field (RF). Convolutional layers with a
large RF capture global information from the input features, while layers with
small RF size capture local details with high resolution from the input
features. In this work, we introduce novel deep multi-resolution fully
convolutional neural networks (MR-FCNN), where each layer has different RF
sizes to extract multi-resolution features that capture the global and local
details information from its input features. The proposed MR-FCNN is applied to
separate a target audio source from a mixture of many audio sources.
Experimental results show that using MR-FCNN improves the performance compared
to feedforward deep neural networks (DNNs) and single resolution deep fully
convolutional neural networks (FCNNs) on the audio source separation problem.Comment: arXiv admin note: text overlap with arXiv:1703.0801
A machine-hearing system exploiting head movements for binaural sound localisation in reverberant conditions
This paper is concerned with machine localisation of multiple active speech sources in reverberant environments using two (binaural) microphones. Such conditions typically present a problem for `classical' binaural models. Inspired by the human ability to utilise head movements, the current study investigated the influence of different head movement strategies on binaural sound localisation. A machine-hearing system that exploits a multi-step head rotation strategy for sound localisation was found to produce the best performance in simulated reverberant acoustic space. This paper also reports the public release of a free binaural room impulse responses (BRIRs) database that allows the simulation of head rotation used in this study
Theory of Sound Field Synthesis
PDF version of the Theory of Sound Field Synthesis presented at http://sfstoolbox.org/en/3.0/
Probing Speech Emotion Recognition Transformers for Linguistic Knowledge
Large, pre-trained neural networks consisting of self-attention layers
(transformers) have recently achieved state-of-the-art results on several
speech emotion recognition (SER) datasets. These models are typically
pre-trained in self-supervised manner with the goal to improve automatic speech
recognition performance -- and thus, to understand linguistic information. In
this work, we investigate the extent in which this information is exploited
during SER fine-tuning. Using a reproducible methodology based on open-source
tools, we synthesise prosodically neutral speech utterances while varying the
sentiment of the text. Valence predictions of the transformer model are very
reactive to positive and negative sentiment content, as well as negations, but
not to intensifiers or reducers, while none of those linguistic features impact
arousal or dominance. These findings show that transformers can successfully
leverage linguistic information to improve their valence predictions, and that
linguistic analysis should be included in their testing.Comment: This work has been submitted for publication to Interspeech 202
Probing speech emotion recognition transformers for linguistic knowledge
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing
A metric for predicting binaural speech intelligibility in stationary noise and competing speech maskers
One criterion in the design of binaural sound scenes in audio production is the extent to which the intended speech message is correctly understood. Object-based audio broadcasting systems have permitted sound editors to gain more access to the metadata (e.g., intensity and location) of each sound source, providing better control over speech intelligibility. The current study describes and evaluates a binaural distortion-weighted glimpse proportion metric -- BiDWGP -- which is motivated by better-ear glimpsing and binaural masking level differences. BiDWGP predicts intelligibility from two alternative input forms: either binaural recordings or monophonic recordings from each sound source along with their locations. Two listening experiments were performed with stationary noise and competing speech, one in the presence of a single masker, the other with multiple maskers, for a variety of spatial conïŹgurations. Overall, BiDWGP with both input forms predicts listener keyword scores with correlations of 0.95 and 0.91 for single- and multi-masker conditions, respectively. When considering masker type separately, correlations rise to 0.95 and above for both types of maskers. Predictions using the two input forms are very similar, suggesting that BiDWGP can be applied to the design of sound scenes where only individual sound sources and their locations are available
The moving minimum audible angle is smaller during self motion than during source motion
We are rarely perfectly still: our heads rotate in three axes and move in three dimensions, constantly varying the spectral and binaural cues at the ear drums. In spite of this motion, static sound sources in the world are typically perceived as stable objects. This argues that the auditory system-in a manner not unlike the vestibulo-ocular reflex-works to compensate for self motion and stabilize our sensory representation of the world. We tested a prediction arising from this postulate: that self motion should be processed more accurately than source motion. We used an infrared motion tracking system to measure head angle, and real-time interpolation of head related impulse responses to create "head-stabilized" signals that appeared to remain fixed in space as the head turned. After being presented with pairs of simultaneous signals consisting of a man and a woman speaking a snippet of speech, normal and hearing impaired listeners were asked to report whether the female voice was to the left or the right of the male voice. In this way we measured the moving minimum audible angle (MMAA). This measurement was made while listeners were asked to turn their heads back and forth between ± 15° and the signals were stabilized in space. After this "self-motion" condition we measured MMAA in a second "source-motion" condition when listeners remained still and the virtual locations of the signals were moved using the trajectories from the first condition. For both normal and hearing impaired listeners, we found that the MMAA for signals moving relative to the head was ~1-2° smaller when the movement was the result of self motion than when it was the result of source motion, even though the motion with respect to the head was identical. These results as well as the results of past experiments suggest that spatial processing involves an ongoing and highly accurate comparison of spatial acoustic cues with self-motion cues
Perzeptive Untersuchung der Schallfeldsynthese
Die vorliegende Arbeit untersucht die beiden Schallfeldsyntheseverfahren Wellenfeldsynthese und Nahfeld-entzerrtes Ambisonics höherer Ordnung. Sie fasst die Theorie der beiden Verfahren zusammen und stellt eine Software-Umgebung zur VerfĂŒgung, um beide Verfahren numerisch zu simulieren. Diskutiert werden mögliche Abweichungen der mit realen Lautsprechergruppen synthetisierten Schallfelder. Dies geschieht sowohl auf theoretischer Basis als auch in einer Reihe von psychoakustischen Experimenten. Die Experimente untersuchen dabei die rĂ€umliche und klangliche Treue und zeitlich-spektrale Artefakte der verwendeten Systeme. Systematisch wird dies fĂŒr eine groĂe Anzahl von verschiedenen Lautsprechergruppen angewendet. Die Experimente werden mit Hilfe von dynamischer binauraler Synthese durchgefĂŒhrt, damit auch Lautsprechergruppen mit einem Abstand von unter 1 cm zwischen den Lautsprechern untersucht werden können. Die Ergebnisse zeigen, dass rĂ€umliche Treue bereits mit einem Lautsprecherabstand von 20 cm erzielt werden kann, wĂ€hrend klangliche Treue nur mit AbstĂ€nden kleiner als 1 cm möglich ist. Zeitlich-spektrale Artefakte treten nur bei der Synthese von fokussierten Quellen auf. Am Ende wird ein binaurales Modell prĂ€sentiert, welches in der Lage ist die rĂ€umliche Treue fĂŒr beliebige Lautsprechergruppen vorherzusagen.This thesis investigates the two sound field synthesis methods Wave Field Synthesis and near-field compensated higher order Ambisonics. It summarizes their theory and provides a software toolkit for corresponding numerical simulations. Possible deviations of the synthesized sound field for real loudspeaker arrays and their perceptual relevance are discussed. This is done firstly based on theoretical considerations, and then addressed in several psychoacoustic experiments. These experiments investigate the spatial and timbral fidelity and spectro-temporal-artifacts in a systematic way for a large number of different loudspeaker setups. The experiments are conducted with the help of dynamic binaural synthesis in order to simulate loudspeaker setups with an inter-loudspeaker spacing of under 1 cm. The results show that spatial fidelity can already be achieved with setups having an inter-loudspeaker spacing of 20 cm, whereas timbral fidelity is only possible for setups employing a spacing below 1 cm. Spectro-temporal artifacts are relevant only for the synthesis of focused sources. At the end of the thesis, a binaural auditory model is presented that is able to predict the spatial fidelity for any given loudspeaker setup
Code to reproduce the figures in the paper 'Assessing localization accuracy in sound field synthesis'
In this upload you find all the scripts and data you need in order to reproduce
the figure from the paper Wierstorf et al., "Assessing localization accuracy in
sound field synthesis" [1].
## Software Requirements
### Sound Field Synthesis Toolbox
From the [Sound Field Synthesis
Toolbox](https://github.com/sfstoolbox/sfs-matlab) git
repository you need to checkout the version *commit 3730bc0*, which is identical
with release 1.0.0. Under Linux this can be done the following way:
```
cd sfs-matlab
cd ..
```
### SOFA (Spatially Oriented Format for Acoustics) Matlab/Octave API
From the [SOFA Matlab/Octave API](https://github.com/sofacoustics/API_MO) git
repository you need to checkout the version *commit 260079a*, which is identical
with release 1.0.1. Under Linux this can be done the following way:
```
cd sofa
cd ..
```
### Two!Ears Auditory Front-end
Form the [Two!Ears Auditory
Front-end](https://github.com/TWOEARS/auditory-front-end) git repository you
need to checkout the version *commit ce47b54*, which is identical with the
release 1.0. Under Linux this can be done the following way:
```
cd auditory-front-end
cd ..
```
## Reproduce data
After downloading all needed packages start Matlab. Then we have to initialize
all the toolboxes. This can be done by running the following commands from the
`sfs/` and `amtoolbox/` directory, respectively.
```Matlab
>> cd sfs-matlab
>> SFS_start;
>> cd ../sofa/API_MO
>> SOFAstart;
>> cd ../../auditory-front-end
>> startAuditoryFrontEnd;
>> cd ..
```
Now everything is prepared and you can recreate the data from the numerical
simulation in Fig.1 and the ITD estimation by the binaural model in Fig.6:
```Matlab
>> cd fig01
>> fig01
>> cd ../fig06
>> fig06
```
## Reproduce figures
All figures were plotted using gnuplot 5.0. Every figure folder has an
``figXX.plt`` file that you can execute and you will get the resulting pdf file.
Only Fig. 2 is generated via tikz and you have to compile the ``fig02.tex`` file
using pdflatex.
## References
[1] H. Wierstorf, A. Raake, S. Spors, "Assessing localization accuracy in sound
field synthesis," J. Acoust. Soc. Am., 141, p. 1111-1119 (2017), doi:10.1121/1.4976061
- âŠ