Performance-score synchronization is an integral task in signal processing,
which entails generating an accurate mapping between an audio recording of a
performance and the corresponding musical score. Traditional synchronization
methods compute alignment using knowledge-driven and stochastic approaches, and
are typically unable to generalize well to different domains and modalities. We
present a novel data-driven method for structure-aware performance-score
synchronization. We propose a convolutional-attentional architecture trained
with a custom loss based on time-series divergence. We conduct experiments for
the audio-to-MIDI and audio-to-image alignment tasks pertained to different
score modalities. We validate the effectiveness of our method via ablation
studies and comparisons with state-of-the-art alignment approaches. We
demonstrate that our approach outperforms previous synchronization methods for
a variety of test settings across score modalities and acoustic conditions. Our
method is also robust to structural differences between the performance and
score sequences, which is a common limitation of standard alignment approaches.Comment: Published in IEEE Signal Processing Letters, Volume 29, December 202