TY - JOUR
T1 - Compressing Large Language Models using Low Rank and Low Precision Decomposition
AU - Saha, Rajarshi
AU - Sagan, Naomi
AU - Srivastava, Varun
AU - Goldsmith, Andrea J.
AU - Pilanci, Mert
N1 - Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.
PY - 2024
Y1 - 2024
N2 - The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA - a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a low-rank, low-precision decomposition as W ≈ Q + LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q + LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA obtains this decomposition by formulating it as an optimization problem minQ, L, R∥(Q + LR − W)Xτ∥2F, where X is the calibration data, and Q, L, R are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-2 7B/13B/70B and LlaMa-3 8B models using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.
AB - The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA - a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a low-rank, low-precision decomposition as W ≈ Q + LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q + LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA obtains this decomposition by formulating it as an optimization problem minQ, L, R∥(Q + LR − W)Xτ∥2F, where X is the calibration data, and Q, L, R are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-2 7B/13B/70B and LlaMa-3 8B models using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.
UR - http://www.scopus.com/inward/record.url?scp=105000491480&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105000491480&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:105000491480
SN - 1049-5258
VL - 37
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024
Y2 - 9 December 2024 through 15 December 2024
ER -