Many tasks in music information retrieval (MIR) involve weakly aligned data,
where exact temporal correspondences are unknown. The connectionist temporal
classification (CTC) loss is a standard technique to learn feature
representations based on weakly aligned training data. However, CTC is limited
to discrete-valued target sequences and can be difficult to extend to
multi-label problems. In this article, we show how soft dynamic time warping
(SoftDTW), a differentiable variant of classical DTW, can be used as an
alternative to CTC. Using multi-pitch estimation as an example scenario, we
show that SoftDTW yields results on par with a state-of-the-art multi-label
extension of CTC. In addition to being more elegant in terms of its algorithmic
formulation, SoftDTW naturally extends to real-valued target sequences.Comment: Accepted at ICASSP 202