771 research outputs found
A Web Audio Node for the Fast Creation of Natural Language Interfaces for Audio Production
Audio production involves the use of tools such as reverberators, compressors, and equalizers to transform raw audio into a state ready for public consumption. These tools are in wide use by both musicians and expert audio engineers for this purpose. The typical interfaces for these tools use low-level signal parameters as controls for the audio effect. These signal parameters often have unintuitive names such as “feedback” or “low-high” that have little meaning to many people. This makes them diffi cult to use and learn for many people. Such low-level interfaces are also common throughout audio production interfaces using the Web Audio API. Recent work in bridging the semantic gap between verbal descriptions of audio effects (e.g. “underwater”, “warm”, “bright”) and low-level signal parameters has resulted in provably better interfaces for a population of laypeople. In that work, a vocabulary of hundreds of descriptive terms was crowdsourced, along with their mappings to audio effects settings for reverberation and equalization. In this paper, we present a Web Audio node that lets web developers leverage this vocabulary to easily create web-based audio effects tools that use natural language interfaces. Our Web Audio node and additional documentation can be accessed at https://interactiveaudiolab.github.io/audealize_api
High-Fidelity Neural Phonetic Posteriorgrams
A phonetic posteriorgram (PPG) is a time-varying categorical distribution
over acoustic units of speech (e.g., phonemes). PPGs are a popular
representation in speech generation due to their ability to disentangle
pronunciation features from speaker identity, allowing accurate reconstruction
of pronunciation (e.g., voice conversion) and coarse-grained pronunciation
editing (e.g., foreign accent conversion). In this paper, we demonstrably
improve the quality of PPGs to produce a state-of-the-art interpretable PPG
representation. We train an off-the-shelf speech synthesizer using our PPG
representation and show that high-quality PPGs yield independent control over
pitch and pronunciation. We further demonstrate novel uses of PPGs, such as an
acoustic pronunciation distance and fine-grained pronunciation control.Comment: Accepted to ICASSP 2024 Workshop on Explainable Machine Learning for
Speech and Audi
Exploring Musical Roots: Applying Audio Embeddings to Empower Influence Attribution for a Generative Music Model
Every artist has a creative process that draws inspiration from previous
artists and their works. Today, "inspiration" has been automated by generative
music models. The black box nature of these models obscures the identity of the
works that influence their creative output. As a result, users may
inadvertently appropriate, misuse, or copy existing artists' works. We
establish a replicable methodology to systematically identify similar pieces of
music audio in a manner that is useful for understanding training data
attribution. A key aspect of our approach is to harness an effective music
audio similarity measure. We compare the effect of applying CLMR and CLAP
embeddings to similarity measurement in a set of 5 million audio clips used to
train VampNet, a recent open source generative music model. We validate this
approach with a human listening study. We also explore the effect that
modifications of an audio example (e.g., pitch shifting, time stretching,
background noise) have on similarity measurements. This work is foundational to
incorporating automated influence attribution into generative modeling, which
promises to let model creators and users move from ignorant appropriation to
informed creation. Audio samples that accompany this paper are available at
https://tinyurl.com/exploring-musical-roots.Comment: 14 pages + references. Under conference revie
- …