64 research outputs found
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
We propose a self-supervised representation learning model for the task of
unsupervised phoneme boundary detection. The model is a convolutional neural
network that operates directly on the raw waveform. It is optimized to identify
spectral changes in the signal using the Noise-Contrastive Estimation
principle. At test time, a peak detection algorithm is applied over the model
outputs to produce the final boundaries. As such, the proposed model is trained
in a fully unsupervised manner with no manual annotations in the form of target
boundaries nor phonetic transcriptions. We compare the proposed approach to
several unsupervised baselines using both TIMIT and Buckeye corpora. Results
suggest that our approach surpasses the baseline models and reaches
state-of-the-art performance on both data sets. Furthermore, we experimented
with expanding the training set with additional examples from the Librispeech
corpus. We evaluated the resulting model on distributions and languages that
were not seen during the training phase (English, Hebrew and German) and showed
that utilizing additional untranscribed data is beneficial for model
performance.Comment: Interspeech 2020 pape
Towards End-to-end Unsupervised Speech Recognition
Unsupervised speech recognition has shown great potential to make Automatic
Speech Recognition (ASR) systems accessible to every language. However,
existing methods still heavily rely on hand-crafted pre-processing. Similar to
the trend of making supervised speech recognition end-to-end, we introduce
\wvu~which does away with all audio-side pre-processing and improves accuracy
through better architecture. In addition, we introduce an auxiliary
self-supervised objective that ties model predictions back to the input.
Experiments show that \wvu~improves unsupervised recognition results across
different languages while being conceptually simpler.Comment: Preprin
- …