11 research outputs found
Efficient Transformers with Dynamic Token Pooling
Transformers achieve unrivalled performance in modelling language, but remain
inefficient in terms of memory and time complexity. A possible remedy is to
reduce the sequence length in the intermediate layers by pooling fixed-length
segments of tokens. Nevertheless, natural units of meaning, such as words or
phrases, display varying sizes. To address this mismatch, we equip language
models with a dynamic-pooling mechanism, which predicts segment boundaries in
an autoregressive fashion. We compare several methods to infer boundaries,
including end-to-end learning through stochastic re-parameterisation,
supervised learning (based on segmentations from subword tokenizers or spikes
in conditional entropy), as well as linguistically motivated boundaries. We
perform character-level evaluation on texts from multiple datasets and
morphologically diverse languages. The results demonstrate that dynamic
pooling, which jointly segments and models language, is both faster and more
accurate than vanilla Transformers and fixed-length pooling within the same
computational budget
Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw
We present a number of low-resource approaches to the tasks of the Zero
Resource Speech Challenge 2021. We build on the unsupervised representations of
speech proposed by the organizers as a baseline, derived from CPC and clustered
with the k-means algorithm. We demonstrate that simple methods of refining
those representations can narrow the gap, or even improve upon the solutions
which use a high computational budget. The results lead to the conclusion that
the CPC-derived representations are still too noisy for training language
models, but stable enough for simpler forms of pattern matching and retrieval.Comment: Published in Interspeech 202
Aligned Contrastive Predictive Coding
We investigate the possibility of forcing a self-supervised model trained
using a contrastive predictive loss to extract slowly varying latent
representations. Rather than producing individual predictions for each of the
future representations, the model emits a sequence of predictions shorter than
that of the upcoming representations to which they will be aligned. In this
way, the prediction network solves a simpler task of predicting the next
symbols, but not their exact timing, while the encoding network is trained to
produce piece-wise constant latent codes. We evaluate the model on a speech
coding task and demonstrate that the proposed Aligned Contrastive Predictive
Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX
error rates, while being slightly faster to train due to the reduced number of
prediction heads.Comment: Published in Interspeech 202
Variable-rate hierarchical CPC leads to acoustic unit discovery in speech
Accepted to 36th Conference on Neural Information Processing Systems (NeurIPS 2022)International audienceThe success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries
Unsupervised Neural Segmentation and Clustering for Unit Discovery in Sequential Data
International audienceWe study the problem of unsupervised segmentation and clustering of handwritten lines with applications to character discovery. We propose a constrained variant of Vector Quantized Variational Autoencoder (VQ-VAE) which produces a discrete and piecewise-constant encoding of the data. We show that the constrained quantization task is dual to a Markovian dynamics prior placed on the latent codes. Such view facilitates a probabilistic interpretation of the constraints and allows efficient inference. We demonstrate the effectiveness of the proposed method in the context of unsupervised handwriting character discovery in 17th-century scanned manuscripts
A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning
International audienceProbabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Vari-ational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaus-sian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset (LibriSpeech), ConvDMM produces features that significantly outperform multiple self-supervised feature extracting methods on linear phone classification and recognition on the Wall Street Journal dataset. Furthermore, we found that ConvDMM complements self-supervised methods like Wav2Vec and PASE, improving on the results achieved with any of the methods alone. Lastly, we find that ConvDMM features enable learning better phone recognizers than any other features in an extreme low-resource regime with few labelled training examples
Aligned Contrastive Predictive Coding
International audienceWe investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss, to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than the sequence of upcoming representations to which they will be aligned. In this way, the prediction network solves a simpler task of predicting the next symbols, but not their exact timing, while the encoding network is trained to produce piece-wise constant latent codes. We evaluate the model on a speech coding task and demonstrate that the proposed Aligned Contrastive Predictive Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX error rates, while being slightly faster to train due to the reduced number of prediction heads
Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw
International audienceWe present a number of low-resource approaches to the tasks of the Zero Resource Speech Challenge 2021. We build on the unsupervised representations of speech proposed by the organizers as a baseline, derived from CPC and clustered with the kmeans algorithm. We demonstrate that simple methods of refining those representations can narrow the gap, or even improve upon the solutions which use a high computational budget. The results lead to the conclusion that the CPC-derived representations are still too noisy for training language models, but stable enough for simpler forms of pattern matching and retrieva