6 research outputs found
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
In deep learning, models typically reuse the same parameters for all inputs.
Mixture of Experts (MoE) defies this and instead selects different parameters
for each incoming example. The result is a sparsely-activated model -- with
outrageous numbers of parameters -- but a constant computational cost. However,
despite several notable successes of MoE, widespread adoption has been hindered
by complexity, communication costs and training instability -- we address these
with the Switch Transformer. We simplify the MoE routing algorithm and design
intuitive improved models with reduced communication and computational costs.
Our proposed training techniques help wrangle the instabilities and we show
large sparse models may be trained, for the first time, with lower precision
(bfloat16) formats. We design models based off T5-Base and T5-Large to obtain
up to 7x increases in pre-training speed with the same computational resources.
These improvements extend into multilingual settings where we measure gains
over the mT5-Base version across all 101 languages. Finally, we advance the
current scale of language models by pre-training up to trillion parameter
models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the
T5-XXL model
Mortal Computation: A Foundation for Biomimetic Intelligence
This review motivates and synthesizes research efforts in
neuroscience-inspired artificial intelligence and biomimetic computing in terms
of mortal computation. Specifically, we characterize the notion of mortality by
recasting ideas in biophysics, cybernetics, and cognitive science in terms of a
theoretical foundation for sentient behavior. We frame the mortal computation
thesis through the Markov blanket formalism and the circular causality entailed
by inference, learning, and selection. The ensuing framework -- underwritten by
the free energy principle -- could prove useful for guiding the construction of
unconventional connectionist computational systems, neuromorphic intelligence,
and chimeric agents, including sentient organoids, which stand to revolutionize
the long-term future of embodied, enactive artificial intelligence and
cognition research.Comment: Several revisions applied, corrected error in Jarzynski equality
equation (w/ new citaion); references and citations now correctly aligne
Exploring Enhanced Motion Modeling Methods for Action Recognition
This thesis aims to address three key issues in action recognition through the enhancement of motion modeling, including handling complex motion variations, improving pseudo label quality in semi-supervised settings, and incorporating explicit motion modeling for transformers.
First, we propose to capture proper motion information, since motion dynamics like moving tempos and action amplitude may vary a lot in different video clips. To this end, we introduce a Motion Diversification and Selection (MoDS) module to generate diversified spatiotemporal motion features and select the most appropriate motion representation for categorizing the input video.
Second, we propose to improve pseudo label quality in semi-supervised action recognition. Previous methods only use a single network to generate pseudo labels, where a single network is limited in capturing different motion patterns simultaneously. To this end, we advocate jointly training a pair of heterogeneous networks, i.e., a 2D CNN and a 3D CNN, to characterize different specific motion patterns simultaneously. Then, we utilize the label propagation strategy within and across these networks to refine pseudo labels.
Third, we propose to perform explicit motion modeling for transformers. We observe that transformer-based methods underperform on motion-sensitive datasets, indicating their limited capacity in temporal modeling. We also note that the conventional motion representation, namely cost volume, is quite similar to the affinity matrix defined in self-attention, but poses powerful motion model capacities. In this case, we propose to examine the essential properties of cost volume for effective motion modeling and integrate them into self-attention to enhance motion representation.
We have conducted comprehensive experiments on widely-used datasets to confirm the effectiveness of our proposed methods. Our approaches have proven to be superior to other advanced methods under different scenarios