47 research outputs found
A hybrid room acoustic modeling approach combining image source, acoustic diffusion equation, and time-domain discontinuous Galerkin methods
In this paper a hybrid model is introduced that constructs a broadband room impulse response using a geometrical (image source method) and a statistical method (acoustic diffusion equation) for the high-frequency range, supported by a wave-based method (time-domain discontinuous Galerkin method) for the low-frequency range. A crucial element concerns the construction of the high-frequency impulse response where a transition from a predominantly specular (image source) to a predominantly diffuse sound-field (diffusion equation) is required. To achieve this transition an analytical envelope is introduced. A key factor is the room-averaged scattering coefficient which accounts for all scattering behavior of the room and determines the speed of transition from a specular to a non-specular sound-field. To evaluate its performance, the model is compared to a broadband wave-based solver for two reference scenarios. The hybrid model shows promising results in terms of reverberation time (T20), center time (Ts) and bass-ratio (BR). Aspects such as the used geometrical complexity, the ‘room-averaged’ scattering coefficients, and other model simplifications and assumptions are discussed.</p
Time-domain Ad-hoc Array Speech Enhancement Using a Triple-path Network
Deep neural networks (DNNs) are very effective for multichannel speech
enhancement with fixed array geometries. However, it is not trivial to use DNNs
for ad-hoc arrays with unknown order and placement of microphones. We propose a
novel triple-path network for ad-hoc array processing in the time domain. The
key idea in the network design is to divide the overall processing into spatial
processing and temporal processing and use self-attention for spatial
processing. Using self-attention for spatial processing makes the network
invariant to the order and the number of microphones. The temporal processing
is done independently for all channels using a recently proposed dual-path
attentive recurrent network. The proposed network is a multiple-input
multiple-output architecture that can simultaneously enhance signals at all
microphones. Experimental results demonstrate the excellent performance of the
proposed approach. Further, we present analysis to demonstrate the
effectiveness of the proposed network in utilizing multichannel information
even from microphones at far locations.Comment: Accepted for publication in INTERSPEECH 202
Direct and Residual Subspace Decomposition of Spatial Room Impulse Responses
Psychoacoustic experiments have shown that directional properties of the direct sound, salient reflections, and the late reverberation of an acoustic room response can have a distinct influence on the auditory perception of a given room. Spatial room impulse responses (SRIRs) capture those properties and thus are used for direction-dependent room acoustic analysis and virtual acoustic rendering. This work proposes a subspace method that decomposes SRIRs into a direct part, which comprises the direct sound and the salient reflections, and a residual, to facilitate enhanced analysis and rendering methods by providing individual access to these components. The proposed method is based on the generalized singular value decomposition and interprets the residual as noise that is to be separated from the other components of the reverberation. Large generalized singular values are attributed to the direct part, which is then obtained as a low-rank approximation of the SRIR. By advancing from the end of the SRIR toward the beginning while iteratively updating the residual estimate, the method adapts to spatio-temporal variations of the residual. The method is evaluated using a spatio-spectral error measure and simulated SRIRs of different rooms, microphone arrays, and ratios of direct sound to residual energy. The proposed method creates lower errors than existing approaches in all tested scenarios, including a scenario with two simultaneous reflections. A case study with measured SRIRs shows the applicability of the method under real-world acoustic conditions. A reference implementation is provided
Perceptual Evaluation of Spatial Room Impulse Response Extrapolation by Direct and Residual Subspace Decomposition
Six-degrees-of-freedom rendering of an acoustic environment can be achieved by interpolating a set of measured spatial room impulse responses (SRIRs). However, the involved measurement effort and computational expense are high. This work compares novel ways of extrapolating a single measured SRIR to a target position. The novel extrapolation techniques are based on a recently proposed subspace method that decomposes SRIRs into a direct part, comprising direct sound and salient reflections, and a residual. We evaluate extrapolations between different positions in a shoebox-shaped room in a multi-stimulus comparison test. Extrapolation using a residual SRIR and salient reflections that match the reflections at the target position is rated as perceptually most similar to the measured reference
Spatial Subtraction of Reflections from Room Impulse Responses Measured With a Spherical Microphone Array
We propose\ua0a\ua0method for the decomposition of\ua0measured\ua0directional\ua0room\ua0impulse\ua0responses\ua0(DRIRs) into prominent\ua0reflections\ua0and\ua0a\ua0residual. The method comprises obtaining\ua0a\ua0fingerprint of the time-frequency signal that\ua0a\ua0given\ua0reflection\ua0carries, imposing this time-frequency fingerprint on\ua0a\ua0plane-wave prototype that exhibits the same propagation direction as the\ua0reflection, and finally subtracting this plane-wave prototype from the DRIR. Our main contributions are the formulation of the problem as\ua0a\ua0spatial\ua0subtraction\ua0as well as the incorporation of order truncation,\ua0spatial\ua0aliasing and regularization of the radial filters into the definition of the underlying beamforming problem. We demonstrate, based on simulated as well as\ua0measured\ua0array\ua0impulse\ua0responses, that our method increases the accuracy of the model of the\ua0reflection\ua0under test and consequently decreases the energy of the residual that remains in\ua0a\ua0measured\ua0DRIR after the\ua0spatial\ua0subtraction
Topological Sound Propagation with Reverberation Graphs
International audienceReverberation graphs is a novel approach to estimate global soundpressure decay and auralize corresponding reverberation effects in interactive virtual environments. We use a 3D model to represent the geometry of the environment explicitly, and we subdivide it into a series of coupled spaces connected by portals. Off-line geometrical-acoustics techniques are used to precompute transport operators, which encode pressure decay characteristics within each space and between coupling interfaces. At run-time, during an interactive simulation, we traverse the adjacency graph corresponding to the spatial subdivision of the environment. We combine transport operators along different sound propagation routes to estimate the pressure decay envelopes from sources to the listener. Our approach compares well with off-line geometrical techniques, but computes reverberation decay envelopes at interactive rates, ranging from 12 to 100 Hz. We propose a scalable artificial reverberator that uses these decay envelopes to auralize reverberation effects, including room coupling. Our complete system can render as many as 30 simultaneous sources in large dynamic virtual environments
Perceptual Evaluation of Spatial Room Impulse Response Extrapolation by Direct and Residual Subspace Decomposition
Six-degrees-of-freedom rendering of an acoustic environment can be achieved by interpolating a set of measured spatial room impulse responses (SRIRs). However, the involved measurement effort and computational expense are high. This work compares novel ways of extrapolating a single measured SRIR to a target position. The novel extrapolation techniques are based on a recently proposed subspace method that decomposes SRIRs into a direct part, comprising direct sound and salient reflections, and a residual. We evaluate extrapolations between different positions in a shoebox-shaped room in a multi-stimulus comparison test. Extrapolation using a residual SRIR and salient reflections that match the reflections at the target position is rated as perceptually most similar to the measured reference
Towards Improved Room Impulse Response Estimation for Speech Recognition
We propose to characterize and improve the performance of blind room impulse
response (RIR) estimation systems in the context of a downstream application
scenario, far-field automatic speech recognition (ASR). We first draw the
connection between improved RIR estimation and improved ASR performance, as a
means of evaluating neural RIR estimators. We then propose a GAN-based
architecture that encodes RIR features from reverberant speech and constructs
an RIR from the encoded features, and uses a novel energy decay relief loss to
optimize for capturing energy-based properties of the input reverberant speech.
We show that our model outperforms the state-of-the-art baselines on acoustic
benchmarks (by 72% on the energy decay relief and 22% on an early-reflection
energy metric), as well as in an ASR evaluation task (by 6.9% in word error
rate)
Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations
Can conversational videos captured from multiple egocentric viewpoints reveal
the map of a scene in a cost-efficient way? We seek to answer this question by
proposing a new problem: efficiently building the map of a previously unseen 3D
environment by exploiting shared information in the egocentric audio-visual
observations of participants in a natural conversation. Our hypothesis is that
as multiple people ("egos") move in a scene and talk among themselves, they
receive rich audio-visual cues that can help uncover the unseen areas of the
scene. Given the high cost of continuously processing egocentric visual
streams, we further explore how to actively coordinate the sampling of visual
information, so as to minimize redundancy and reduce power use. To that end, we
present an audio-visual deep reinforcement learning approach that works with
our shared scene mapper to selectively turn on the camera to efficiently chart
out the space. We evaluate the approach using a state-of-the-art audio-visual
simulator for 3D scenes as well as real-world video. Our model outperforms
previous state-of-the-art mapping methods, and achieves an excellent
cost-accuracy tradeoff. Project: http://vision.cs.utexas.edu/projects/chat2map.Comment: Accepted to CVPR 202