6 research outputs found
AUDIO QUERY-BASED MUSIC SOURCE SEPARATION
ํ์๋
ผ๋ฌธ (์์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ์ตํฉ๊ณผํ๊ธฐ์ ๋ํ์ ๋์งํธ์ ๋ณด์ตํฉํ๊ณผ, 2020. 8. ์ด๊ต๊ตฌ.์ต๊ทผ ๋ช ๋
๋์, ์์
์์ ๋ถ๋ฆฌ๋ ์์
์ ๋ณด ๊ฒ์ ๋ถ์ผ์์ ๊ฐ์ฅ ํ๋ฐํ๊ฒ ์ฐ๊ตฌ
๊ฐ ์ด๋ฃจ์ด์ง ๋ถ์ผ ์ค ํ๋์ด๋ค. ๋ํ ๋ฅ ๋ฌ๋์ ๋ฐ์ ์ผ๋ก ์ธํด ์์
์์ ๋ถ๋ฆฌ ์ฑ๋ฅ์
ํฐ ํญ์ผ๋ก ํฅ์ํ๋ค. ๊ทธ๋ฌ๋ ๋๋ถ๋ถ์ ์ด์ ์ฐ๊ตฌ๋ค์ ๋จ์ผ ์
๊ธฐ ๋๋ ๋ณด์ปฌ, ๋๋ผ, ๋ฒ
์ด์ค์ ๊ฐ์ ์ ํ๋ ์์ ์์์ ๋ถ๋ฆฌํ๋๋ฐ ๊ทธ์ณค์ผ๋ฉฐ, ํ์ฅ์ฑ์ ๋ํ ์ฐ๊ตฌ๋ ๋ง์ด
์ด๋ฃจ์ด์ง์ง ์์๋ค.
๋ณธ ์ฐ๊ตฌ์์๋ ์ค๋์ค ์ฟผ๋ฆฌ ๊ธฐ๋ฐ ์์ ๋ถ๋ฆฌ๋ฅผ ์ํด ๋ชฉํ ์ ํธ์ ์ ๋๋ ์ข
๋ฅ์
๊ด๊ณ์์ด ์ฟผ๋ฆฌ ์ ํธ๋ก๋ถํฐ ์์ค์ ์ ๋ณด๋ฅผ ์ธ์ฝ๋ฉํ ์ ์๋ ๋คํธ์ํฌ๋ฅผ ์ ์ํ๋ค.
์ ์๋ ๊ธฐ๋ฒ์ ์ฟผ๋ฆฌ ์ธ์ฝ๋ฉ ๋คํธ์ํฌ์ ์์ ๋ถ๋ฆฌ ๋คํธ์ํฌ๋ก ๊ตฌ์ฑ๋๋ค. ์ค๋์ค ์ฟผ
๋ฆฌ์ ํฉ์ฑ ์์์ด ์ฃผ์ด์ง๋ฉด ์ฟผ๋ฆฌ ์ธ์ฝ๋ฉ ๋คํธ์ํฌ๋ ์ฟผ๋ฆฌ๋ฅผ ์ ์ฌ ๊ณต๊ฐ์ผ๋ก ์ธ์ฝ๋ฉ
ํ๊ณ , ์์ ๋ถ๋ฆฌ ๋คํธ์ํฌ๋ ์ ์ฌ ๋ฒกํฐ์ ์ํด ์ปจ๋์
๋๋ ๋ง์คํฌ๋ฅผ ์ถ๋ ฅํ๋ฉฐ, ์ด
๋ง์คํฌ๋ ํฉ์ฑ ์์์ ๊ณฑํด์ ธ ์์์ ๋ถ๋ฆฌํ๋ค. ๋ํ ์์ ๋ถ๋ฆฌ ๋คํธ์ํฌ๋ ํ์ต
์ํ์์ ์ป์ด์ง ์ ์ฌ ๋ฒกํฐ๋ฅผ ์ฌ์ฉํ์ฌ ์ค๋์ค ์ฟผ๋ฆฌ๊ฐ ์ฃผ์ด์ง์ง ์์ ํ๊ฒฝ์์๋
๋์ํ ์ ์๋ค.
์ ์ํ ๊ธฐ๋ฒ์ ํ๊ฐ๋ฅผ ์ํด MUSDB18๊ณผ Slakh์ ์ด์ฉํ๋ฉฐ, ์คํ ๊ฒฐ๊ณผ๋ ์ ์๋
๊ธฐ๋ฒ์ด ๋จ์ผ ๋คํธ์ํฌ๋ก ์ฌ๋ฌ ์์ค๋ฅผ ๋ถ๋ฆฌํ ์ ์์์ ๋ณด์ธ๋ค. ๋ํ, ์ ์ฌ ๊ณต๊ฐ์
๋ํ ๋ถ์์ ํตํด ์ ์๋ ๊ธฐ๋ฒ์ด ์ ์ฌ ๋ฒกํฐ์ ๋ณด๊ฐ์ ํตํด ์ฐ์์ ์ธ ์ถ๋ ฅ์ ์์ฑํ
์ ์์์ ๋ณด์ธ๋คIn recent years, music source separation has been one of the most intensively studied research areas in music information retrieval. Improvements in deep learning lead
to a big progress in music source separation performance. However, most of the previous studies are restricted to separating a few limited number of sources, such as vocals,
drums, bass, and other.
In this study, we propose a network for audio query-based music source separation
that can explicitly encode the source information from a query signal regardless of the
number and/or kind of target signals. The proposed method consists of a Query-net
and a Separator: given a query and a mixture, the Query-net encodes the query into the
latent space, and the Separator estimates masks conditioned by the latent vector, which
is then applied to the mixture for separation. The Separator can also generate masks
using the latent vector from the training samples, allowing separation in the absence
of a query.
We evaluate our method on the MUSDB18 dataset and the Slakh dataset, and experimental results show that the proposed method can separate multiple sources with a
single network. In addition, through further investigation of the latent space we demonstrate that our method can generate continuous outputs via latent vector interpolation.์ 1 ์ฅ ์๋ก 5
1.1 ์ฐ๊ตฌ ๋ฐฐ๊ฒฝ 5
1.2 ์ฐ๊ตฌ ๋ชฉํ 8
์ 2 ์ฅ ๋ฐฐ๊ฒฝ ์ด๋ก ๋ฐ ๊ด๋ จ ์ฐ๊ตฌ 10
2.1 ๋ฐฐ๊ฒฝ ์ด๋ก 10
2.1.1 ์์ ๋ถ๋ฆฌ 10
2.1.2 Variational Autoencoder 11
2.2 ๊ด๋ จ ์ฐ๊ตฌ 14
2.2.1 ์์ ๋ถ๋ฆฌ ์ฐ๊ตฌ 14
2.2.2 ๊ธฐํ ๋ถ์ผ ์ฐ๊ตฌ 17
์ 3 ์ฅ ์ ์ ๊ธฐ๋ฒ 20
3.1 ์ค๋์ค ์ฟผ๋ฆฌ ๊ธฐ๋ฐ ์์ ๋ถ๋ฆฌ 20
3.2 ํ์ต 23
3.2.1 ํ์ต ๋ฐ์ดํฐ ๊ตฌ์ฑ 23
3.2.2 ํ์ต ๋ชฉ์ 24
3.3 ํ
์คํธ 26
์ 4 ์ฅ ์คํ 28
4.1 ๋ฐ์ดํฐ์
28
4.2 ์คํ ์์ธ ์ค์ 30
4.3 ์๋ก์ด ์ํ์ ๋ํ ์ฟผ๋ฆฌ ์ธ์ฝ๋ฉ ๋คํธ์ํฌ ๋์ 31
4.4 ์ค๋์ค ์ฟผ๋ฆฌ๋ฅผ ์ด์ฉํ ํน์ ์
๊ธฐ ๋ถ๋ฆฌ 32
4.5 ์ ์ฌ ๋ฒกํฐ ๋ณด๊ฐ์ ์ด์ฉํ ์์ ๋ถ๋ฆฌ 34
4.6 ์ ์ฌ ๋ฒกํฐ๊ฐ ์์ ๋ถ๋ฆฌ ์ฑ๋ฅ์ ๋ฏธ์น๋ ์ํฅ ๋ถ์ 35
4.7 ์ธ๋ถํ๋ ํด๋์ค ์ ๋ณด๋ฅผ ์ด์ฉํ ์์ ๋ถ๋ฆฌ ๋น๊ต ์คํ 38
4.8 ๋ถ๋ฆฌ ๋ฐ๋ณต๋ฒ 40
4.9 ์ ๋ ํ๊ฐ 43
์ 5 ์ฅ ๊ฒฐ๋ก 46
5.1 ์ฐ๊ตฌ ์์ 46
5.2 ํฅํ ์ฐ๊ตฌ 47
ABSTRACT 56Maste
Self-Supervised Music Source Separation Using Vector-Quantized Source Category Estimates
Music source separation is focused on extracting distinct sonic elements from
composite tracks. Historically, many methods have been grounded in supervised
learning, necessitating labeled data, which is occasionally constrained in its
diversity. More recent methods have delved into N-shot techniques that utilize
one or more audio samples to aid in the separation. However, a challenge with
some of these methods is the necessity for an audio query during inference,
making them less suited for genres with varied timbres and effects. This paper
offers a proof-of-concept for a self-supervised music source separation system
that eliminates the need for audio queries at inference time. In the training
phase, while it adopts a query-based approach, we introduce a modification by
substituting the continuous embedding of query audios with Vector Quantized
(VQ) representations. Trained end-to-end with up to N classes as determined by
the VQ's codebook size, the model seeks to effectively categorise instrument
classes. During inference, the input is partitioned into N sources, with some
potentially left unutilized based on the mix's instrument makeup. This
methodology suggests an alternative avenue for considering source separation
across diverse music genres. We provide examples and additional results online.Comment: 4 pages, 2 figures, 1 table; Accepted at the 37th Conference on
Neural Information Processing Systems (2023), Machine Learning for Audio
Worksho
Gendering the Virtual Space: Sonic Femininities and Masculinities in Contemporary Top 40 Music
This dissertation analyzes vocal placementโthe apparent location of a voice in the virtual space created by a recordingโand its relationship to gender. When listening to a piece of recorded music through headphones or stereo speakers, one hears various sound sources as though they were located in a virtual space (Clarke 2013). For instance, a specific vocal performanceโonce manipulated by various technologies in a recording studioโmight evoke a concert hall, an intimate setting, or an otherworldly space. The placement of the voice within this space is one of the central musical parameters through which listeners ascribe cultural meanings to popular music.
I develop an original methodology for analyzing vocal placement in recorded popular music. Combining close listening with music information retrieval tools, I precisely locate a voiceโs placement in virtual space according to five parameters: (1) Width, (2) Pitch Height, (3) Prominence, (4) Environment, and (5) Layering. I use the methodology to conduct close and distant readings of vocal placement in twenty-first-century Anglo-American popular music. First, an analysis of โLove the Way You Lieโ (2010), by Eminem feat. Rihanna, showcases how the methodology can be used to support close readings of individual songs. Through my analysis, I suggest that Rihannaโs wide vocal placement evokes a nexus of conflicting emotions in the wake of domestic violence. Eminemโs narrow placement, conversely, expresses anger, frustration, and violence. Second, I use the analytical methodology to conduct a larger-scale study of vocal placement in a corpus of 113 post-2008 Billboard chart-topping collaborations between two or more artists. By stepping away from close readings of individual songs, I show how gender stereotypes are engineered en masse in the popular music industry. I show that women artists are generally assigned vocal placements that are wider, more layered, and more reverberated than those of men. This vocal placement configurationโexemplified in โLove the Way You Lieโโcreates a sonic contrast that presents womenโs voices as ornamental and diffuse, and menโs voices as direct and relatable. I argue that these contrasting vocal placements sonically construct a gender binary, exemplifying one of the ways in which dichotomous conceptions of gender are reinforced through the sound of popular music