Search CORE

4 research outputs found

Learning the helix topology of musical pitch

Author: Bello Juan Pablo
Farnsworth Andrew
Lostanlen Vincent
McFee Brian
Sridhar Sripathi
Publication venue
Publication date: 04/02/2020
Field of study

To explain the consonance of octaves, music psychologists represent pitch as a helix where azimuth and axial coordinate correspond to pitch class and pitch height respectively. This article addresses the problem of discovering this helical structure from unlabeled audio data. We measure Pearson correlations in the constant-Q transform (CQT) domain to build a K-nearest neighbor graph between frequency subbands. Then, we run the Isomap manifold learning algorithm to represent this graph in a three-dimensional space in which straight lines approximate graph geodesics. Experiments on isolated musical notes demonstrate that the resulting manifold resembles a helix which makes a full turn at every octave. A circular shape is also found in English speech, but not in urban noise. We discuss the impact of various design choices on the visualization: instrumentarium, loudness mapping function, and number of neighbors K.Comment: 5 pages, 6 figures. To appear in the Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Barcelona, Spain, May 202

arXiv.org e-Print Archive

Helicality: An Isomap-based Measure of Octave Equivalence in Audio Data

Author: Lostanlen Vincent
Sridhar Sripathi
Publication venue
Publication date: 01/10/2020
Field of study

Octave equivalence serves as domain-knowledge in MIR systems, including chromagram, spiral convolutional networks, and harmonic CQT. Prior work has applied the Isomap manifold learning algorithm to unlabeled audio data to embed frequency sub-bands in 3-D space where the Euclidean distances are inversely proportional to the strength of their Pearson correlations. However, discovering octave equivalence via Isomap requires visual inspection and is not scalable. To address this problem, we define "helicality" as the goodness of fit of the 3-D Isomap embedding to a Shepherd-Risset helix. Our method is unsupervised and uses a custom Frank-Wolfe algorithm to minimize a least-squares objective inside a convex hull. Numerical experiments indicate that isolated musical notes have a higher helicality than speech, followed by drum hits.Comment: 3 pages, 3 figures. To be presented at the 21st International Society for Music Information Retrieval (ISMIR) Conference. Montreal, Canada, October 202

arXiv.org e-Print Archive

I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch

Author: Henry Max
Turian Joseph
Publication venue
Publication date: 09/12/2020
Field of study

Growing research demonstrates that synthetic failure modes imply poor generalization. We compare commonly used audio-to-audio losses on a synthetic benchmark, measuring the pitch distance between two stationary sinusoids. The results are surprising: many have poor sense of pitch direction. These shortcomings are exposed using simple rank assumptions. Our task is trivial for humans but difficult for these audio distances, suggesting significant progress can be made in self-supervised audio learning by improving current losses.Comment: ICBINB@NeurIPS 202

arXiv.org e-Print Archive

Time-Frequency Scattering Accurately Models Auditory Similarities Between Instrumental Playing Techniques

Author: Andén Joakim
El-Hajj Christian
Lafay Grégoire
Lagrange Mathieu
Lostanlen Vincent
Rossignol Mathias
Publication venue
Publication date: 10/11/2020
Field of study

Instrumental playing techniques such as vibratos, glissandos, and trills often denote musical expressivity, both in classical and folk contexts. However, most existing approaches to music similarity retrieval fail to describe timbre beyond the so-called "ordinary" technique, use instrument identity as a proxy for timbre quality, and do not allow for customization to the perceptual idiosyncrasies of a new subject. In this article, we ask 31 human subjects to organize 78 isolated notes into a set of timbre clusters. Analyzing their responses suggests that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. In addition, we propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques. Our model relies on joint time--frequency scattering features to extract spectrotemporal modulations as acoustic features. Furthermore, it minimizes triplet loss in the cluster graph by means of the large-margin nearest neighbor (LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we report a state-of-the-art average precision at rank five (AP@5) of

99.0\%\pm1

. An ablation study demonstrates that removing either the joint time--frequency scattering transform or the metric learning algorithm noticeably degrades performance.Comment: 32 pages, 5 figures. To appear in EURASIP Journal on Audio, Speech, and Music Processing (JASMP

arXiv.org e-Print Archive