Acoustic word embeddings are typically created by training a pooling function
using pairs of word-like units. For unsupervised systems, these are mined using
k-nearest neighbor (KNN) search, which is slow. Recently, mean-pooled
representations from a pre-trained self-supervised English model were suggested
as a promising alternative, but their performance on target languages was not
fully competitive. Here, we explore improvements to both approaches: we use
continued pre-training to adapt the self-supervised model to the target
language, and we use a multilingual phone recognizer (MPR) to mine phone n-gram
pairs for training the pooling function. Evaluating on four languages, we show
that both methods outperform a recent approach on word discrimination.
Moreover, the MPR method is orders of magnitude faster than KNN, and is highly
data efficient. We also show a small improvement from performing learned
pooling on top of the continued pre-trained representations.Comment: Accepted to Interspeech 202