INTERPRETING THE EFFECT OF QUANTIZATION ON LLMS
Date
2024-12-12
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Recent advancements in large language models (LLMs) have led to unprecedented model sizes, creating challenges in deployment for resource-constrained environments. Quantization offers a promising solution to this challenge by reducing weight precision, thereby decreasing memory footprint and computational requirements while potentially maintaining model performance. However, it is crucial to understand how quantization affects their internal representations and overall behavior for the reliable deployment of quantized LLMs.
In this research, using various interpretation techniques, we explore the effects of quantization on model and neurons behavior. We investigate Phi-2 and Llama-2-7b models, employing 4-bit and 8-bit quantization, using the BoolQ and Jigsaw Toxicity datasets.
Our findings reveal several important insights. First, 4-bit quantized models exhibit slightly better calibration than 8-bit and 16-bit models. Second, our analysis of neuron activations indicates that the number of dead neurons, i.e., those with activation values close to 0 across the dataset, remains consistent regardless of quantization. Regarding salient neurons, we observe that full-precision models have fewer contributing neurons overall. The effect of quantization on neuron redundancy varies across models. In Llama-2-7b, we observed minimal variation in neuron redundancy across quantization levels. In contrast, Phi-2 exhibited higher redundancy in its full-precision than its quantized counterparts. Finally, our investigation into human-level interpretation demonstrates that the learning pattern of salient neurons remains consistent under various quantization conditions.
These findings suggest that quantization is a viable approach for the efficient and reliable deployment of LLMs in resource-constrained environments.
Description
Keywords
Artificial Intelligence, Deep Learning, LLMS, Quantization