1,100 research outputs found
InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models
Music editing primarily entails the modification of instrument tracks or
remixing in the whole, which offers a novel reinterpretation of the original
piece through a series of operations. These music processing methods hold
immense potential across various applications but demand substantial expertise.
Prior methodologies, although effective for image and audio modifications,
falter when directly applied to music. This is attributed to music's
distinctive data nature, where such methods can inadvertently compromise the
intrinsic harmony and coherence of music. In this paper, we develop InstructME,
an Instruction guided Music Editing and remixing framework based on latent
diffusion models. Our framework fortifies the U-Net with multi-scale
aggregation in order to maintain consistency before and after editing. In
addition, we introduce chord progression matrix as condition information and
incorporate it in the semantic space to improve melodic harmony while editing.
For accommodating extended musical pieces, InstructME employs a chunk
transformer, enabling it to discern long-term temporal dependencies within
music sequences. We tested InstructME in instrument-editing, remixing, and
multi-round editing. Both subjective and objective evaluations indicate that
our proposed method significantly surpasses preceding systems in music quality,
text relevance and harmony. Demo samples are available at
https://musicedit.github.io/Comment: Demo samples are available at https://musicedit.github.io
SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning
In many situations, we would like to hear desired sound events (SEs) while
being able to ignore interference. Target sound extraction (TSE) aims at
tackling this problem by estimating the sound of target SE classes in a mixture
while suppressing all other sounds. We can achieve this with a neural network
that extracts the target SEs by conditioning it on clues representing the
target SE classes. Two types of clues have been proposed, i.e., target SE class
labels and enrollment sound samples similar to the target sound. Systems based
on SE class labels can directly optimize embedding vectors representing the SE
classes, resulting in high extraction performance. However, extending these
systems to the extraction of new SE classes not encountered during training is
not easy. Enrollment-based approaches extract SEs by finding sounds in the
mixtures that share similar characteristics to the enrollment. These approaches
do not explicitly rely on SE class definitions and can thus handle new SE
classes. In this paper, we introduce a TSE framework, SoundBeam, that combines
the advantages of both approaches. We also perform an extensive evaluation of
the different TSE schemes using synthesized and real mixtures, which shows the
potential of SoundBeam.Comment: Submitted to IEEE/ACM Trans. Audio, Speech, and Language Processin
Bass Accompaniment Generation via Latent Diffusion
The ability to automatically generate music that appropriately matches an arbitrary input track is a challenging task. We present a novel controllable system for generating single stems to accompany musical mixes of arbitrary length. At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations, and a conditional latent diffusion model that takes as input the latent encoding of a mix and generates the latent encoding of a corresponding stem. To provide control over the timbre of generated samples, we introduce a technique to ground the latent space to a user-provided reference style during diffusion sampling. For further improving audio quality, we adapt classifier-free guidance to avoid distortions at high guidance strengths when generating an unbounded latent space. We train our model on a dataset of pairs of mixes and matching bass stems. Quantitative experiments demonstrate that, given an input mix, the proposed system can generate basslines with user-specified timbres. Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production
Transfer Learning and Bias Correction with Pre-trained Audio Embeddings
Deep neural network models have become the dominant approach to a large
variety of tasks within music information retrieval (MIR). These models
generally require large amounts of (annotated) training data to achieve high
accuracy. Because not all applications in MIR have sufficient quantities of
training data, it is becoming increasingly common to transfer models across
domains. This approach allows representations derived for one task to be
applied to another, and can result in high accuracy with less stringent
training data requirements for the downstream task. However, the properties of
pre-trained audio embeddings are not fully understood. Specifically, and unlike
traditionally engineered features, the representations extracted from
pre-trained deep networks may embed and propagate biases from the model's
training regime. This work investigates the phenomenon of bias propagation in
the context of pre-trained audio representations for the task of instrument
recognition. We first demonstrate that three different pre-trained
representations (VGGish, OpenL3, and YAMNet) exhibit comparable performance
when constrained to a single dataset, but differ in their ability to generalize
across datasets (OpenMIC and IRMAS). We then investigate dataset identity and
genre distribution as potential sources of bias. Finally, we propose and
evaluate post-processing countermeasures to mitigate the effects of bias, and
improve generalization across datasets.Comment: 7 pages, 3 figures, accepted to the conference of the International
Society for Music Information Retrieval (ISMIR 2023
- …