As the size of large language models (LLMs) continues to grow, model
compression without sacrificing accuracy has become a crucial challenge for
deployment. While some quantization methods, such as GPTQ, have made progress
in achieving acceptable 4-bit weight-only quantization, attempts at lower bit
quantization often result in severe performance degradation. In this paper, we
introduce a technique called norm tweaking, which can be used as a plugin in
current PTQ methods to achieve high precision while being cost-efficient. Our
approach is inspired by the observation that rectifying the quantized
activation distribution to match its float counterpart can readily restore
accuracy for LLMs. To achieve this, we carefully design a tweaking strategy
that includes calibration data generation and channel-wise distance constraint
to update the weights of normalization layers for better generalization. We
conduct extensive experiments on various datasets using several open-sourced
LLMs. Our method demonstrates significant improvements in both weight-only
quantization and joint quantization of weights and activations, surpassing
existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the
same level of accuracy at 2-bit quantization as their float ones. Our simple
and effective approach makes it more practical for real-world applications