19,699 research outputs found
Mind's Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models
Large language models (LLMs) have achieved remarkable advancements in the
field of natural language processing. However, the sheer scale and
computational demands of these models present formidable challenges when
considering their practical deployment in resource-constrained contexts. While
techniques such as chain-of-thought (CoT) distillation have displayed promise
in distilling LLMs into small language models (SLMs), there is a risk that
distilled SLMs may still carry over flawed reasoning or hallucinations
inherited from their LLM counterparts. To address these issues, we propose a
twofold methodology: First, we introduce a novel method for distilling the
self-evaluation capability inherent in LLMs into SLMs, which aims to mitigate
the adverse effects of erroneous reasoning and reduce hallucinations. Second,
we advocate for a comprehensive distillation process that incorporates multiple
distinct chain-of-thought and self-evaluation paradigms and ensures a more
holistic and robust knowledge transfer into SLMs. Experiments on three NLP
benchmarks demonstrate that our method significantly improves the performance
of distilled SLMs and sheds light on the path towards developing smaller models
closely aligned with human cognition.Comment: 13 pages, 5 figure
BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation
The upscaling of Large Language Models (LLMs) has yielded impressive advances
in natural language processing, yet it also poses significant deployment
challenges. Weight quantization has emerged as a widely embraced solution to
reduce memory and computational demands. This paper introduces BitDistiller, a
framework that synergizes Quantization-Aware Training (QAT) with Knowledge
Distillation (KD) to boost the performance of LLMs at ultra-low precisions
(sub-4-bit). Specifically, BitDistiller first incorporates a tailored
asymmetric quantization and clipping technique to maximally preserve the
fidelity of quantized weights, and then proposes a novel Confidence-Aware
Kullback-Leibler Divergence (CAKLD) objective, which is employed in a
self-distillation manner to enable faster convergence and superior model
performance. Empirical evaluations demonstrate that BitDistiller significantly
surpasses existing methods in both 3-bit and 2-bit configurations on general
language understanding and complex reasoning benchmarks. Notably, BitDistiller
is shown to be more cost-effective, demanding fewer data and training
resources. The code is available at https://github.com/DD-DuDa/BitDistiller
Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance
Models for natural language understanding (NLU) tasks often rely on the
idiosyncratic biases of the dataset, which make them brittle against test cases
outside the training distribution. Recently, several proposed debiasing methods
are shown to be very effective in improving out-of-distribution performance.
However, their improvements come at the expense of performance drop when models
are evaluated on the in-distribution data, which contain examples with higher
diversity. This seemingly inevitable trade-off may not tell us much about the
changes in the reasoning and understanding capabilities of the resulting models
on broader types of examples beyond the small subset represented in the
out-of-distribution data. In this paper, we address this trade-off by
introducing a novel debiasing method, called confidence regularization, which
discourage models from exploiting biases while enabling them to receive enough
incentive to learn from all the training examples. We evaluate our method on
three NLU tasks and show that, in contrast to its predecessors, it improves the
performance on out-of-distribution datasets (e.g., 7pp gain on HANS dataset)
while maintaining the original in-distribution accuracy.Comment: to appear at ACL 202
- …