6 research outputs found

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Full text link
    In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model

    Mortal Computation: A Foundation for Biomimetic Intelligence

    Full text link
    This review motivates and synthesizes research efforts in neuroscience-inspired artificial intelligence and biomimetic computing in terms of mortal computation. Specifically, we characterize the notion of mortality by recasting ideas in biophysics, cybernetics, and cognitive science in terms of a theoretical foundation for sentient behavior. We frame the mortal computation thesis through the Markov blanket formalism and the circular causality entailed by inference, learning, and selection. The ensuing framework -- underwritten by the free energy principle -- could prove useful for guiding the construction of unconventional connectionist computational systems, neuromorphic intelligence, and chimeric agents, including sentient organoids, which stand to revolutionize the long-term future of embodied, enactive artificial intelligence and cognition research.Comment: Several revisions applied, corrected error in Jarzynski equality equation (w/ new citaion); references and citations now correctly aligne

    Exploring Enhanced Motion Modeling Methods for Action Recognition

    Get PDF
    This thesis aims to address three key issues in action recognition through the enhancement of motion modeling, including handling complex motion variations, improving pseudo label quality in semi-supervised settings, and incorporating explicit motion modeling for transformers. First, we propose to capture proper motion information, since motion dynamics like moving tempos and action amplitude may vary a lot in different video clips. To this end, we introduce a Motion Diversification and Selection (MoDS) module to generate diversified spatiotemporal motion features and select the most appropriate motion representation for categorizing the input video. Second, we propose to improve pseudo label quality in semi-supervised action recognition. Previous methods only use a single network to generate pseudo labels, where a single network is limited in capturing different motion patterns simultaneously. To this end, we advocate jointly training a pair of heterogeneous networks, i.e., a 2D CNN and a 3D CNN, to characterize different specific motion patterns simultaneously. Then, we utilize the label propagation strategy within and across these networks to refine pseudo labels. Third, we propose to perform explicit motion modeling for transformers. We observe that transformer-based methods underperform on motion-sensitive datasets, indicating their limited capacity in temporal modeling. We also note that the conventional motion representation, namely cost volume, is quite similar to the affinity matrix defined in self-attention, but poses powerful motion model capacities. In this case, we propose to examine the essential properties of cost volume for effective motion modeling and integrate them into self-attention to enhance motion representation. We have conducted comprehensive experiments on widely-used datasets to confirm the effectiveness of our proposed methods. Our approaches have proven to be superior to other advanced methods under different scenarios
    corecore