73 research outputs found
YAP不活性化におけるFurryの役割とCEP97分解における14-3-3の関与
Tohoku University大橋一正課
Language Modeling with Deep Transformers
We explore deep autoregressive Transformer models in language modeling for
speech recognition. We focus on two aspects. First, we revisit Transformer
model configurations specifically for language modeling. We show that well
configured Transformer models outperform our baseline models based on the
shallow stack of LSTM recurrent neural network layers. We carry out experiments
on the open-source LibriSpeech 960hr task, for both 200K vocabulary word-level
and 10K byte-pair encoding subword-level language modeling. We apply our
word-level models to conventional hybrid speech recognition by lattice
rescoring, and the subword-level models to attention based encoder-decoder
models by shallow fusion. Second, we show that deep Transformer language models
do not require positional encoding. The positional encoding is an essential
augmentation for the self-attention mechanism which is invariant to sequence
ordering. However, in autoregressive setup, as is the case for language
modeling, the amount of information increases along the position dimension,
which is a positional signal by its own. The analysis of attention weights
shows that deep autoregressive self-attention models can automatically make use
of such positional information. We find that removing the positional encoding
even slightly improves the performance of these models.Comment: To appear in the proceedings of INTERSPEECH 201
Improved training of end-to-end attention models for speech recognition
Sequence-to-sequence attention-based models on subword units allow simple
open-vocabulary end-to-end speech recognition. In this work, we show that such
models can achieve competitive results on the Switchboard 300h and LibriSpeech
1000h tasks. In particular, we report the state-of-the-art word error rates
(WER) of 3.54% on the dev-clean and 3.82% on the test-clean evaluation subsets
of LibriSpeech. We introduce a new pretraining scheme by starting with a high
time reduction factor and lowering it during training, which is crucial both
for convergence and final performance. In some experiments, we also use an
auxiliary CTC loss function to help the convergence. In addition, we train long
short-term memory (LSTM) language models on subword units. By shallow fusion,
we report up to 27% relative improvements in WER over the attention baseline
without a language model.Comment: submitted to Interspeech 201
Accelerating Neural Self-Improvement via Bootstrapping
Few-shot learning with sequence-processing neural networks (NNs) has recently
attracted a new wave of attention in the context of large language models. In
the standard N-way K-shot learning setting, an NN is explicitly optimised to
learn to classify unlabelled inputs by observing a sequence of NK labelled
examples. This pressures the NN to learn a learning algorithm that achieves
optimal performance, given the limited number of training examples. Here we
study an auxiliary loss that encourages further acceleration of few-shot
learning, by applying recently proposed bootstrapped meta-learning to NN
few-shot learners: we optimise the K-shot learner to match its own performance
achievable by observing more than NK examples, using only NK examples.
Promising results are obtained on the standard Mini-ImageNet dataset. Our code
is public.Comment: Presented at ICLR 2023 Workshop on Mathematical and Empirical
Understanding of Foundation Models,
https://openreview.net/forum?id=SDwUYcyOCy
Topological Neural Discrete Representation Learning \`a la Kohonen
Unsupervised learning of discrete representations from continuous ones in
neural networks (NNs) is the cornerstone of several applications today. Vector
Quantisation (VQ) has become a popular method to achieve such representations,
in particular in the context of generative models such as Variational
Auto-Encoders (VAEs). For example, the exponential moving average-based VQ
(EMA-VQ) algorithm is often used. Here we study an alternative VQ algorithm
based on the learning rule of Kohonen Self-Organising Maps (KSOMs; 1982) of
which EMA-VQ is a special case. In fact, KSOM is a classic VQ algorithm which
is known to offer two potential benefits over the latter: empirically, KSOM is
known to perform faster VQ, and discrete representations learned by KSOM form a
topological structure on the grid whose nodes are the discrete symbols,
resulting in an artificial version of the topographic map in the brain. We
revisit these properties by using KSOM in VQ-VAEs for image processing. In
particular, our experiments show that, while the speed-up compared to
well-configured EMA-VQ is only observable at the beginning of training, KSOM is
generally much more robust than EMA-VQ, e.g., w.r.t. the choice of
initialisation schemes. Our code is public.Comment: Two first author
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
Despite progress across a broad range of applications, Transformers have
limited success in systematic generalization. The situation is especially
frustrating in the case of algorithmic tasks, where they often fail to find
intuitive solutions that route relevant information to the right node/operation
at the right time in the grid represented by Transformer columns. To facilitate
the learning of useful control flow, we propose two modifications to the
Transformer architecture, copy gate and geometric attention. Our novel Neural
Data Router (NDR) achieves 100% length generalization accuracy on the classic
compositional table lookup task, as well as near-perfect accuracy on the simple
arithmetic task and a new variant of ListOps testing for generalization across
computational depths. NDR's attention and gating patterns tend to be
interpretable as an intuitive form of neural routing. Our code is public.Comment: Accepted to ICLR 202
Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions
Recent studies of the computational power of recurrent neural networks (RNNs)
reveal a hierarchy of RNN architectures, given real-time and finite-precision
assumptions. Here we study auto-regressive Transformers with linearised
attention, a.k.a. linear Transformers (LTs) or Fast Weight Programmers (FWPs).
LTs are special in the sense that they are equivalent to RNN-like sequence
processors with a fixed-size state, while they can also be expressed as the
now-popular self-attention networks. We show that many well-known results for
the standard Transformer directly transfer to LTs/FWPs. Our formal language
recognition experiments demonstrate how recently proposed FWP extensions such
as recurrent FWPs and self-referential weight matrices successfully overcome
certain limitations of the LT, e.g., allowing for generalisation on the parity
problem. Our code is public.Comment: Accepted to EMNLP 2023 (short paper
Exploring the Promise and Limits of Real-Time Recurrent Learning
Real-time recurrent learning (RTRL) for sequence-processing recurrent neural
networks (RNNs) offers certain conceptual advantages over backpropagation
through time (BPTT). RTRL requires neither caching past activations nor
truncating context, and enables online learning. However, RTRL's time and space
complexity make it impractical. To overcome this problem, most recent work on
RTRL focuses on approximation theories, while experiments are often limited to
diagnostic settings. Here we explore the practical promise of RTRL in more
realistic settings. We study actor-critic methods that combine RTRL and policy
gradients, and test them in several subsets of DMLab-30, ProcGen, and
Atari-2600 environments. On DMLab memory tasks, our system trained on fewer
than 1.2 B environmental frames is competitive with or outperforms well-known
IMPALA and R2D2 baselines trained on 10 B frames. To scale to such challenging
tasks, we focus on certain well-known neural architectures with element-wise
recurrence, allowing for tractable RTRL without approximation. We also discuss
rarely addressed limitations of RTRL in real-world applications, such as its
complexity in the multi-layer case
On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
In conventional speech recognition, phoneme-based models outperform
grapheme-based models for non-phonetic languages such as English. The
performance gap between the two typically reduces as the amount of training
data is increased. In this work, we examine the impact of the choice of
modeling unit for attention-based encoder-decoder models. We conduct
experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks, using various
target units (phoneme, grapheme, and word-piece); across all tasks, we find
that grapheme or word-piece models consistently outperform phoneme-based
models, even though they are evaluated without a lexicon or an external
language model. We also investigate model complementarity: we find that we can
improve WERs by up to 9% relative by rescoring N-best lists generated from a
strong word-piece based baseline with either the phoneme or the grapheme model.
Rescoring an N-best list generated by the phonemic system, however, provides
limited improvements. Further analysis shows that the word-piece-based models
produce more diverse N-best hypotheses, and thus lower oracle WERs, than
phonemic models.Comment: To appear in the proceedings of INTERSPEECH 201
- …