6 research outputs found
ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization
Due to its simplicity and outstanding ability to generalize, stochastic
gradient descent (SGD) is still the most widely used optimization method
despite its slow convergence. Meanwhile, adaptive methods have attracted rising
attention of optimization and machine learning communities, both for the
leverage of life-long information and for the profound and fundamental
mathematical theory. Taking the best of both worlds is the most exciting and
challenging question in the field of optimization for machine learning. Along
this line, we revisited existing adaptive gradient methods from a novel
perspective, refreshing understanding of second moments. Our new perspective
empowers us to attach the properties of second moments to the first moment
iteration, and to propose a novel first moment optimizer,
\emph{Angle-Calibrated Moment method} (\method). Our theoretical results show
that \method is able to achieve the same convergence rate as mainstream
adaptive methods. Furthermore, extensive experiments on CV and NLP tasks
demonstrate that \method has a comparable convergence to SOTA Adam-type
optimizers, and gains a better generalization performance in most cases.Comment: 25 pages, 4 figure
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Transfer learning, where a model is first pre-trained on a data-rich task
before being fine-tuned on a downstream task, has emerged as a powerful
technique in natural language processing (NLP). The effectiveness of transfer
learning has given rise to a diversity of approaches, methodology, and
practice. In this paper, we explore the landscape of transfer learning
techniques for NLP by introducing a unified framework that converts all
text-based language problems into a text-to-text format. Our systematic study
compares pre-training objectives, architectures, unlabeled data sets, transfer
approaches, and other factors on dozens of language understanding tasks. By
combining the insights from our exploration with scale and our new ``Colossal
Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks
covering summarization, question answering, text classification, and more. To
facilitate future work on transfer learning for NLP, we release our data set,
pre-trained models, and code.Comment: Final version as published in JML