3 research outputs found
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
Recent research has highlighted the importance of dataset size in scaling
language models. However, large language models (LLMs) are notoriously
token-hungry during pre-training, and high-quality text data on the web is
approaching its scaling limit for LLMs. To further enhance LLMs, a
straightforward approach is to repeat the pre-training data for additional
epochs. In this study, we empirically investigate three key aspects under this
approach. First, we explore the consequences of repeating pre-training data,
revealing that the model is susceptible to overfitting, leading to multi-epoch
degradation. Second, we examine the key factors contributing to multi-epoch
degradation, finding that significant factors include dataset size, model
parameters, and training objectives, while less influential factors consist of
dataset quality and model FLOPs. Finally, we explore whether widely used
regularization can alleviate multi-epoch degradation. Most regularization
techniques do not yield significant improvements, except for dropout, which
demonstrates remarkable effectiveness but requires careful tuning when scaling
up the model size. Additionally, we discover that leveraging mixture-of-experts
(MoE) enables cost-effective and efficient hyper-parameter tuning for
computationally intensive dense LLMs with comparable trainable parameters,
potentially impacting efficient LLM development on a broader scale.Comment: Accepted at NeurIPS 202
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Large language models (LLMs) have revolutionized the field of AI,
demonstrating unprecedented capacity across various tasks. However, the
inference process for LLMs comes with significant computational costs. In this
paper, we propose an efficient LLM inference pipeline that harnesses the power
of LLMs. Our approach begins by tapping into the potential of LLMs to
accurately perceive and predict the response length with minimal overhead. By
leveraging this information, we introduce an efficient sequence scheduling
technique that groups queries with similar response lengths into micro-batches.
We evaluate our approach on real-world instruction datasets using the
LLaMA-based model, and our results demonstrate an impressive 86% improvement in
inference throughput without compromising effectiveness. Notably, our method is
orthogonal to other inference acceleration techniques, making it a valuable
addition to many existing toolkits (e.g., FlashAttention, Quantization) for LLM
inference
CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU
The click-through rate (CTR) prediction task is to predict whether a user
will click on the recommended item. As mind-boggling amounts of data are
produced online daily, accelerating CTR prediction model training is critical
to ensuring an up-to-date model and reducing the training cost. One approach to
increase the training speed is to apply large batch training. However, as shown
in computer vision and natural language processing tasks, training with a large
batch easily suffers from the loss of accuracy. Our experiments show that
previous scaling rules fail in the training of CTR prediction neural networks.
To tackle this problem, we first theoretically show that different frequencies
of ids make it challenging to scale hyperparameters when scaling the batch
size. To stabilize the training process in a large batch size setting, we
develop the adaptive Column-wise Clipping (CowClip). It enables an easy and
effective scaling rule for the embeddings, which keeps the learning rate
unchanged and scales the L2 loss. We conduct extensive experiments with four
CTR prediction networks on two real-world datasets and successfully scaled 128
times the original batch size without accuracy loss. In particular, for CTR
prediction model DeepFM training on the Criteo dataset, our optimization
framework enlarges the batch size from 1K to 128K with over 0.1% AUC
improvement and reduces training time from 12 hours to 10 minutes on a single
V100 GPU. Our code locates at https://github.com/bytedance/LargeBatchCTR.Comment: arXiv admin note: text overlap with arXiv:2201.1089