4 research outputs found
Learning the helix topology of musical pitch
To explain the consonance of octaves, music psychologists represent pitch as
a helix where azimuth and axial coordinate correspond to pitch class and pitch
height respectively. This article addresses the problem of discovering this
helical structure from unlabeled audio data. We measure Pearson correlations in
the constant-Q transform (CQT) domain to build a K-nearest neighbor graph
between frequency subbands. Then, we run the Isomap manifold learning algorithm
to represent this graph in a three-dimensional space in which straight lines
approximate graph geodesics. Experiments on isolated musical notes demonstrate
that the resulting manifold resembles a helix which makes a full turn at every
octave. A circular shape is also found in English speech, but not in urban
noise. We discuss the impact of various design choices on the visualization:
instrumentarium, loudness mapping function, and number of neighbors K.Comment: 5 pages, 6 figures. To appear in the Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP). Barcelona, Spain, May 202
Helicality: An Isomap-based Measure of Octave Equivalence in Audio Data
Octave equivalence serves as domain-knowledge in MIR systems, including
chromagram, spiral convolutional networks, and harmonic CQT. Prior work has
applied the Isomap manifold learning algorithm to unlabeled audio data to embed
frequency sub-bands in 3-D space where the Euclidean distances are inversely
proportional to the strength of their Pearson correlations. However,
discovering octave equivalence via Isomap requires visual inspection and is not
scalable. To address this problem, we define "helicality" as the goodness of
fit of the 3-D Isomap embedding to a Shepherd-Risset helix. Our method is
unsupervised and uses a custom Frank-Wolfe algorithm to minimize a
least-squares objective inside a convex hull. Numerical experiments indicate
that isolated musical notes have a higher helicality than speech, followed by
drum hits.Comment: 3 pages, 3 figures. To be presented at the 21st International Society
for Music Information Retrieval (ISMIR) Conference. Montreal, Canada, October
202
I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch
Growing research demonstrates that synthetic failure modes imply poor
generalization. We compare commonly used audio-to-audio losses on a synthetic
benchmark, measuring the pitch distance between two stationary sinusoids. The
results are surprising: many have poor sense of pitch direction. These
shortcomings are exposed using simple rank assumptions. Our task is trivial for
humans but difficult for these audio distances, suggesting significant progress
can be made in self-supervised audio learning by improving current losses.Comment: ICBINB@NeurIPS 202
Time-Frequency Scattering Accurately Models Auditory Similarities Between Instrumental Playing Techniques
Instrumental playing techniques such as vibratos, glissandos, and trills
often denote musical expressivity, both in classical and folk contexts.
However, most existing approaches to music similarity retrieval fail to
describe timbre beyond the so-called "ordinary" technique, use instrument
identity as a proxy for timbre quality, and do not allow for customization to
the perceptual idiosyncrasies of a new subject. In this article, we ask 31
human subjects to organize 78 isolated notes into a set of timbre clusters.
Analyzing their responses suggests that timbre perception operates within a
more flexible taxonomy than those provided by instruments or playing techniques
alone. In addition, we propose a machine listening model to recover the cluster
graph of auditory similarities across instruments, mutes, and techniques. Our
model relies on joint time--frequency scattering features to extract
spectrotemporal modulations as acoustic features. Furthermore, it minimizes
triplet loss in the cluster graph by means of the large-margin nearest neighbor
(LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we
report a state-of-the-art average precision at rank five (AP@5) of
. An ablation study demonstrates that removing either the joint
time--frequency scattering transform or the metric learning algorithm
noticeably degrades performance.Comment: 32 pages, 5 figures. To appear in EURASIP Journal on Audio, Speech,
and Music Processing (JASMP