Numerous examples in the literature proved that deep learning models have the
ability to work well with multimodal data. Recently, CLIP has enabled deep
learning systems to learn shared latent spaces between images and text
descriptions, with outstanding zero- or few-shot results in downstream tasks.
In this paper we explore the same idea proposed by CLIP but applied to the
speech domain, where the phonetic and acoustic spaces usually coexist. We train
a CLIP-based model with the aim to learn shared representations of phonetic and
acoustic spaces. The results show that the proposed model is sensible to
phonetic changes, with a 91% of score drops when replacing 20% of the phonemes
at random, while providing substantial robustness against different kinds of
noise, with a 10% performance drop when mixing the audio with 75% of Gaussian
noise. We also provide empirical evidence showing that the resulting embeddings
are useful for a variety of downstream applications, such as intelligibility
evaluation and the ability to leverage rich pre-trained phonetic embeddings in
speech generation task. Finally, we discuss potential applications with
interesting implications for the speech generation and recognition fields.Comment: In proceedings of the 26th European Conference on Artificial
Intelligence ECAI 2023. 8 pages + 1 appendix pag