Through solving pretext tasks, self-supervised learning leverages unlabeled
data to extract useful latent representations replacing traditional input
features in the downstream task. In audio/speech signal processing, a wide
range of features where engineered through decades of research efforts. As it
turns out, learning to predict such features (a.k.a pseudo-labels) has proven
to be a particularly relevant pretext task, leading to useful self-supervised
representations which prove to be effective for downstream tasks. However,
methods and common practices for combining such pretext tasks for better
performance on the downstream task have not been explored and understood
properly. In fact, the process relies almost exclusively on a computationally
heavy experimental procedure, which becomes intractable with the increase of
the number of pretext tasks. This paper introduces a method to select a group
of pretext tasks among a set of candidates. The method we propose estimates
calibrated weights for the partial losses corresponding to the considered
pretext tasks during the self-supervised training process. The experiments
conducted on automatic speech recognition, speaker and emotion recognition
validate our approach, as the groups selected and weighted with our method
perform better than classic baselines, thus facilitating the selection and
combination of relevant pseudo-labels for self-supervised representation
learning