Compressing Large Language Models using Low Rank and Low Precision Decomposition

Rajarshi Saha, Naomi Sagan, Varun Srivastava, Andrea J. Goldsmith, Mert Pilanci

Research output: Contribution to journalConference articlepeer-review

Abstract

The prohibitive sizes of Large Language Models (LLMs) today make it difficult to deploy them on memory-constrained edge devices. This work introduces CALDERA - a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix W by approximating it via a low-rank, low-precision decomposition as W ≈ Q + LR. Here, L and R are low rank factors, and the entries of Q, L and R are quantized. The model is compressed by substituting each layer with its Q + LR decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, L and R are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. CALDERA obtains this decomposition by formulating it as an optimization problem minQ, L, R∥(Q + LR − W)Xτ∥2F, where X is the calibration data, and Q, L, R are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of CALDERA are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-2 7B/13B/70B and LlaMa-3 8B models using CALDERA outperforms existing post-training LLM compression techniques in the regime of less than 2.5 bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
Volume37
StatePublished - 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: Dec 9 2024Dec 15 2024

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Compressing Large Language Models using Low Rank and Low Precision Decomposition'. Together they form a unique fingerprint.

Cite this