20 research outputs found
A Web Audio Node for the Fast Creation of Natural Language Interfaces for Audio Production
Audio production involves the use of tools such as reverberators, compressors, and equalizers to transform raw audio into a state ready for public consumption. These tools are in wide use by both musicians and expert audio engineers for this purpose. The typical interfaces for these tools use low-level signal parameters as controls for the audio effect. These signal parameters often have unintuitive names such as “feedback” or “low-high” that have little meaning to many people. This makes them diffi cult to use and learn for many people. Such low-level interfaces are also common throughout audio production interfaces using the Web Audio API. Recent work in bridging the semantic gap between verbal descriptions of audio effects (e.g. “underwater”, “warm”, “bright”) and low-level signal parameters has resulted in provably better interfaces for a population of laypeople. In that work, a vocabulary of hundreds of descriptive terms was crowdsourced, along with their mappings to audio effects settings for reverberation and equalization. In this paper, we present a Web Audio node that lets web developers leverage this vocabulary to easily create web-based audio effects tools that use natural language interfaces. Our Web Audio node and additional documentation can be accessed at https://interactiveaudiolab.github.io/audealize_api
High-Fidelity Audio Compression with Improved RVQGAN
Language models have been successfully used to model natural signals, such as
images, speech, and music. A key component of these models is a high quality
neural compression model that can compress high-dimensional natural signals
into lower dimensional discrete tokens. To that end, we introduce a
high-fidelity universal neural audio compression algorithm that achieves ~90x
compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve
this by combining advances in high-fidelity audio generation with better vector
quantization techniques from the image domain, along with improved adversarial
and reconstruction losses. We compress all domains (speech, environment, music,
etc.) with a single universal model, making it widely applicable to generative
modeling of all audio. We compare with competing audio compression algorithms,
and find our method outperforms them significantly. We provide thorough
ablations for every design choice, as well as open-source code and trained
model weights. We hope our work can lay the foundation for the next generation
of high-fidelity audio modeling.Comment: Accepted at NeurIPS 2023 (spotlight
Sound Event Detection and Separation: a Benchmark on Desed Synthetic Soundscapes
We propose a benchmark of state-of-the-art sound event detection systems
(SED). We designed synthetic evaluation sets to focus on specific sound event
detection challenges. We analyze the performance of the submissions to DCASE
2021 task 4 depending on time related modifications (time position of an event
and length of clips) and we study the impact of non-target sound events and
reverberation. We show that the localization in time of sound events is still a
problem for SED systems. We also show that reverberation and non-target sound
events are severely degrading the performance of the SED systems. In the latter
case, sound separation seems like a promising solution