Recent data- and learning-based sound source localization (SSL) methods have
shown strong performance in challenging acoustic scenarios. However, little
work has been done on adapting such methods to track consistently multiple
sources appearing and disappearing, as would occur in reality. In this paper,
we present a new training strategy for deep learning SSL models with a
straightforward implementation based on the mean squared error of the optimal
association between estimated and reference positions in the preceding time
frames. It optimizes the desired properties of a tracking system: handling a
time-varying number of sources and ordering localization estimates according to
their trajectories, minimizing identity switches (IDSs). Evaluation on
simulated data of multiple reverberant moving sources and on two model
architectures proves its effectiveness on reducing identity switches without
compromising frame-wise localization accuracy.Comment: Accepted for publication at the 31st European Signal Processing
Conference (EUSIPCO 2023