Compressing Large Language Models (LLMs) using Knowledge Distillation for Optimizing Inference Time and Model Size

Abstract

Large Language Models (LLMs) contain a vast number of parameters and are significantly large in size. For instance, the DeepSeek-V3 model consists of approximately 671 billion parameters and has a file size of up to 720GB. The sheer number of parameters in LLMs reflects their high complexity, which can serve as both an advantage and a drawback, particularly when deployed in environments with limited computational resources. This study focuses on compressing a custom-built lightweight model using knowledge distillation techniques applied to LLMs. The results indicate that the model’s parameters can be reduced by up to 94.18%, its file size by up to 71.00%, and its inference time by up to 1.13%. Notably, despite these reductions, the model remains capable of performing specialized tasks with satisfactory accuracy. This finding underscores the potential of knowledge distillation as an effective method for reducing model size while maintaining operational efficiency, particularly in scenarios where computational constraints lead to mismatched capabilities. Efficiency in knowledge distillation is achieved through a combination of model size reduction and the alignment of computational capacity with task-specific requirements

Similar works

Full text

thumbnail-image

Universitas Islam Raden Rahmat (UNIRA) Malang: Journals

redirect
Last time updated on 07/07/2025

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: https://creativecommons.org/licenses/by/4.0