The paper presents a method for spoken term detection based on the
Transformer architecture. We propose the encoder-encoder architecture employing
two BERT-like encoders with additional modifications, including convolutional
and upsampling layers, attention masking, and shared parameters. The encoders
project a recognized hypothesis and a searched term into a shared embedding
space, where the score of the putative hit is computed using the calibrated dot
product. In the experiments, we used the Wav2Vec 2.0 speech recognizer, and the
proposed system outperformed a baseline method based on deep LSTMs on the
English and Czech STD datasets based on USC Shoah Foundation Visual History
Archive (MALACH).Comment: Submitted to ICASSP 202