7 research outputs found
Contrastive Siamese Network for Semi-supervised Speech Recognition
This paper introduces contrastive siamese (c-siam) network, an architecture
for leveraging unlabeled acoustic data in speech recognition. c-siam is the
first network that extracts high-level linguistic information from speech by
matching outputs of two identical transformer encoders. It contains augmented
and target branches which are trained by: (1) masking inputs and matching
outputs with a contrastive loss, (2) incorporating a stop gradient operation on
the target branch, (3) using an extra learnable transformation on the augmented
branch, (4) introducing new temporal augment functions to prevent the shortcut
learning problem. We use the Libri-light 60k unsupervised data and the
LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other
best-performing systems. Our experiments show that c-siam provides 20% relative
word error rate improvement over wav2vec baselines. A c-siam network with 450M
parameters achieves competitive results compared to the state-of-the-art
networks with 600M parameters