18 research outputs found
Modeling the Frequency-Dependent Sound Energy Decay of Acoustic Environments with Differentiable Feedback Delay Networks
Differentiable machine learning techniques have recently proved effective for finding the parameters of Feedback Delay Networks (FDNs) so that their output matches desired perceptual qualities of target room impulse responses. However, we show that existing methods tend to fail at modeling the frequency-dependent behavior of sound energy decay that characterizes real-world environments unless properly trained. In this paper, we introduce a novel perceptual loss function based on the mel-scale energy decay relief, which generalizes the well-known time-domain energy decay curve to multiple frequency bands. We also augment the prototype FDN by incorporating differentiable wideband attenuation and output filters, and train them via backpropagation along with the other model parameters. The proposed approach improves upon existing strategies for designing and training differentiable FDNs, making it more suitable for audio processing applications where realistic and controllable artificial reverberation is desirable, such as gaming, music production, and virtual reality
Differentiable MIMO Feedback Delay Networks for Multichannel Room Impulse Response Modeling
Recently, with the advent of new performing headsets and goggles, the demand for Virtual and Augmented Reality applications has experienced a steep increase. In order to coherently navigate the virtual rooms, the acoustics of the scene must be emulated in the most accurate and efficient way possible. Amongst others, Feedback Delay Networks (FDNs) have proved to be valuable tools for tackling such a task. In this article, we expand and adapt a method recently proposed for the data-driven optimization of single-input-single-output FDNs to the multiple-input-multiple-output (MIMO) case for addressing spatial/space-time processing applications. By testing our methodology on items taken from two different datasets, we show that the parameters of MIMO FDNs can be jointly optimized to match some perceptual characteristics of given multichannel room impulse responses, overcoming approaches available in the literature, and paving the way toward increasingly efficient and accurate real-time virtual room acoustics rendering
A latent rhythm complexity model for attribute-controlled drum pattern generation
AbstractMost music listeners have an intuitive understanding of the notion of rhythm complexity. Musicologists and scientists, however, have long sought objective ways to measure and model such a distinctively perceptual attribute of music. Whereas previous research has mainly focused on monophonic patterns, this article presents a novel perceptually-informed rhythm complexity measure specifically designed for polyphonic rhythms, i.e., patterns in which multiple simultaneous voices cooperate toward creating a coherent musical phrase. We focus on drum rhythms relating to the Western musical tradition and validate the proposed measure through a perceptual test where users were asked to rate the complexity of real-life drumming performances. Hence, we propose a latent vector model for rhythm complexity based on a recurrent variational autoencoder tasked with learning the complexity of input samples and embedding it along one latent dimension. Aided by an auxiliary adversarial loss term promoting disentanglement, this effectively regularizes the latent space, thus enabling explicit control over the complexity of newly generated patterns. Trained on a large corpus of MIDI files of polyphonic drum recordings, the proposed method proved capable of generating coherent and realistic samples at the desired complexity value. In our experiments, output and target complexities show a high correlation, and the latent space appears interpretable and continuously navigable. On the one hand, this model can readily contribute to a wide range of creative applications, including, for instance, assisted music composition and automatic music generation. On the other hand, it brings us one step closer toward achieving the ambitious goal of equipping machines with a human-like understanding of perceptual features of music
Hybrid Packet Loss Concealment for Real-Time Networked Music Applications
Real-time audio communications over IP have become essential to our daily lives. Packet-switched networks, however, are inherently prone to jitter and data losses, thus creating a strong need for effective packet loss concealment (PLC) techniques. Though solutions based on deep learning have made significant progress in that direction as far as speech is concerned, extending the use of such methods to applications of Networked Music Performance (NMP) presents significant challenges, including high fidelity requirements, higher sampling rates, and stringent temporal constraints associated to the simultaneous interaction between remote musicians. In this article, we present PARCnet, a hybrid PLC method that utilizes a feed-forward neural network to estimate the time-domain residual signal of a parallel linear autoregressive model. Objective metrics and a listening test show that PARCnet provides state-of-the-art results while enabling real-time operation on CPU
Toward deep drum source separation
In the past, the field of drum source separation faced significant challenges due to limited data availability, hindering the adoption of cutting-edge deep learning methods that have found success in other related audio applications. In this letter, we introduce StemGMD, a large-scale audio dataset of isolated single-instrument drum stems. Each audio clip is synthesized from MIDI recordings of expressive drum performances using ten real-sounding acoustic drum kits. Totaling 1224 h, StemGMD is the largest audio dataset of drums to date and the first to comprise isolated audio clips for every instrument in a canonical nine-piece drum kit. We leverage StemGMD to develop LarsNet, a novel deep drum source separation model. Through a bank of dedicated U-Nets, LarsNet can separate five stems from a stereo drum mixture faster than real-time and is shown to considerably outperform state-of-the-art nonnegative spectro-temporal factorization methods
A Deep Learning Approach for Low-Latency Packet Loss Concealment of Audio Signals in Networked Music Performance Applications
Networked Music Performance (NMP) is envisioned as a potential game changer
among Internet applications: it aims at revolutionizing the traditional concept
of musical interaction by enabling remote musicians to interact and perform
together through a telecommunication network. Ensuring realistic conditions for
music performance, however, constitutes a significant engineering challenge due
to extremely strict requirements in terms of audio quality and, most
importantly, network delay. To minimize the end-to-end delay experienced by the
musicians, typical implementations of NMP applications use un-compressed,
bidirectional audio streams and leverage UDP as transport protocol. Being
connection less and unreliable,audio packets transmitted via UDP which become
lost in transit are not re-transmitted and thus cause glitches in the receiver
audio playout. This article describes a technique for predicting lost packet
content in real-time using a deep learning approach. The ability of concealing
errors in real time can help mitigate audio impairments caused by packet
losses, thus improving the quality of audio playout in real-world scenarios.Comment: 8 pages, 2 figure
Data-Driven Room Acoustic Modeling Via Differentiable Feedback Delay Networks With Learnable Delay Lines
Over the past few decades, extensive research has been devoted to the design of artificial reverberation algorithms aimed at emulating the room acoustics of physical environments. Despite significant advancements, automatic parameter tuning of delay-network models remains an open challenge. We introduce a novel method for finding the parameters of a feedback delay network (FDN) such that its output renders target attributes of a measured room impulse response. The proposed approach involves the implementation of a differentiable FDN with trainable delay lines, which, for the first time, allows us to simultaneously learn each and every delay-network parameter via backpropagation. The iterative optimization process seeks to minimize a perceptually motivated time-domain loss function incorporating differentiable terms accounting for energy decay and echo density. Through experimental validation, we show that the proposed method yields time-invariant frequency-independent FDNs capable of closely matching the desired acoustical characteristics and outperforms existing methods based on genetic algorithms and analytical FDN design
Data-Driven Parameter Estimation of Lumped-Element Models via Automatic Differentiation
Lumped-element models (LEMs) provide a compact characterization of numerous real-world physical systems, including electrical, acoustic, and mechanical systems. However, even when the target topology is known, deriving model parameters that approximate a possibly distributed system often requires educated guesses or dedicated optimization routines. This article presents a general framework for the data-driven estimation of lumped parameters using automatic differentiation. Inspired by recent work on physical neural networks, we propose to explicitly embed a differentiable LEM in the forward pass of a learning algorithm and discover its parameters via backpropagation. The same approach could also be applied to blindly parameterize an approximating model that shares no isomorphism with the target system, for which it would be thus challenging to exploit prior knowledge of the underlying physics. We evaluate our framework on various linear and nonlinear systems, including time- and frequency-domain learning objectives, and consider real- and complex-valued differentiation strategies. In all our experiments, we were able to achieve a near-perfect match of the system state measurements and retrieve the true model parameters whenever possible. Besides its practical interest, the present approach provides a fully interpretable input-output mapping by exposing the topological structure of the underlying physical model, and it may therefore constitute an explainable ad-hoc alternative to otherwise black-box methods
Zero-shot anomalous sound detection in domestic environments using large-scale pretrained audio pattern recognition models
Anomalous sound detection is central to audio-based surveillance and monitoring. In a domestic environment, however, the classes of sounds to be considered anomalous are situation-dependent and cannot be determined in advance. At the same time, it is not feasible to expect a demanding labeling effort from the end user. To address these problems, we present a novel zero-shot method relying on an auxiliary large-scale pretrained audio neural network in support of an unsupervised anomaly detector. The auxiliary module is tasked to generate a fingerprint for each sound occasionally registered by the user. These fingerprints are then compared with those extracted from the input audio stream, and the resulting similarity score is used to increase or reduce the sensitivity of the base detector. Experimental results on synthetic data show that the proposed method substantially improves upon the unsupervised base detector and is capable of outperforming existing few-shot learning systems developed for machine condition monitoring without involving additional training