A mixed-precision quantization method for large language models via memory alignment

doi:10.11996/JG.j.2095-302X.2026010039

Abstract

Abstract:

As large models continue to grow in scale, the memory footprint and computational overhead of model inference have become critical challenges. Mixed-precision quantization is an effective approach to reduce resource consumption, but existing methods suffer from insufficient outlier handling, significant quantization accuracy loss, and inefficient memory access. To address these issues, a memory-aligned mixed-precision quantization method for large models was proposed. First, weights were divided into SIMD-aligned groups, and outlier groups were identified via group-wise significance analysis, with high-significance groups quantized to 8 bit and others to 2 bit. A block-wise compensation strategy was introduced to mitigate accuracy degradation caused by 2 bit quantization. Furthermore, an efficient packing and storage scheme was designed for mixed-precision weights, where a bitmap was used to record the bit width of each data block, enabling random access. Experimental results demonstrated that the proposed method significantly reduced memory usage and improved computational efficiency while maintaining model accuracy. Specifically, on Llama2-7 B/13 B/70 B, the approach achieved perplexity reductions of 8.13/2.84/1.37 on WikiText-2 and 5.80 on C4 relative to state-of-the-art baselines. The quantized 70 B model reduced weight storage by approximately 87% compared with BF16. Across seven QA benchmarks, an average accuracy gain of 6.24% was achieved. Last, these results indicated that a mixed-precision quantization method for large language models via memory alignment could simultaneously improve compression ratio, memory-access efficiency, and overall model performance.

Key words: large language model compression, post-training quantization, low-bit quantization, mixed-precision quantization, outlier extraction

CLC Number:

TP391

LI Zhangming, GUAN Weifan, CHANG Zhengwei, ZHANG Linghao, HU Qinghao. A mixed-precision quantization method for large language models via memory alignment[J]. Journal of Graphics, 2026, 47(1): 39-46.

Figures/Tables 7

References 35

[1]	GUO C, TANG J M, HU W M, et al. OliVe: accelerating large language models via hardware-friendly outlier-victim pair quantization[C]// The 50th Annual International Symposium on Computer Architecture. New York: ACM, 2023: 3.
[2]	SHANG Y Z, YUAN Z H, WU Q, et al. PB-LLM: partially binarized large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2310.00034.
[3]	HUANG W, LIU Y D, QIN H T, et al. BiLLM: pushing the limit of post-training quantization for LLMs[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2402.04291.
[4]	FRANTAR E, ALISTARH D. SparseGPT: massive language models can be accurately pruned in one-shot[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2301.00774.
[5]	SUN M J, LIU Z, BAIR A, et al. A simple and effective pruning approach for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.11695.
[6]	GU Y X, DONG L, WEI F R, et al. MiniLLM: knowledge distillation of large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.08543.
[7]	QIU J T, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]// 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York: ACM, 2016: 26-35.
[8]	LIN D D, TALATHI S S, ANNAPUREDDY V S. Fixed point quantization of deep convolutional networks[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1511.06393.
[9]	COURBARIAUX M, BENGIO Y, DAVID J P. BinaryConnect: training deep neural networks with binary weights during propagations[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1511.00363.
[10]	LIU Z C, OGUZ B, ZHAO C S, et al. LLM-QAT: data-free quantization aware training for large language models[C]// Findings of the Association for Computational Linguistics. New York: ACL, 2024: 467-484.
[11]	FRANTAR E, ASHKBOOS S, HOEFLER T, et al. GPTQ: accurate post-training quantization for generative pre-trained transformers[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2210.17323.
[12]	LEE J H, KIM J, YANG J Y, et al. LRQ: optimizing post-training quantization for large language models by learning low-rank weight-scaling matrices[C]// 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics:Human Language Technologies. New York: ACL, 2025: 7708-7743.
[13]	KIM J, EL HALABI M, PARK W, et al. GuidedQuant: large language model quantization via exploiting end loss guidance[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2505.07004.
[14]	DETTMERS T, LEWIS M, BELKADA Y, et al. LLM.int8(): 8-bit matrix multiplication for transformers at scale[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2208.07339.
[15]	YAO Z W, AMINABADI R Y, ZHANG M J, et al. ZeroQuant: efficient and affordable post-training quantization for large-scale transformers[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2206.01861.
[16]	XIAO G X, LIN J, SEZNEC M, et al. SmoothQuant: accurate and efficient post-training quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2211.10438.
[17]	SHAO W Q, CHEN M Z, ZHANG Z Y, et al. OmniQuant: omnidirectionally calibrated quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2308.13137.
[18]	YAO Z W, WU X X, LI C, et al. ZeroQuant-V2:exploring post-training quantization in LLMs from comprehensive study to low rank compensation[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2303.08302.
[19]	LIN J, TANG J M, TANG H T, et al. AWQ: activation-aware weight quantization for LLM compression and acceleration[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.00978.
[20]	DETTMERS T, SVIRSCHEVSKI R, EGIAZARIAN V, et al. SpQR: a sparse-quantized representation for near-lossless LLM weight compression[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.03078.
[21]	CHEE J, CAI Y H, KULESHOV V, et al. QuIP:2-bit quantization of large language models with guarantees[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2307.13304.
[22]	LI L, LI Q Y, ZHANG B, et al. Norm tweaking: High-performance low-bit quantization of large language models[C]// The 38th AAAI Conference on Artificial Intelligence. Philadelphia: AAAI, 2024: 18536-18544.
[23]	BEHDIN K, ACHARYA A, GUPTA A, et al. QuantEase: optimization-based quantization for language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2309.01885.
[24]	YUAN Z H, NIU L, LIU J W, et al. RPTQ: reorder-based post-training quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2304.01089.
[25]	MERITY S, XIONG C M, BRADBURY J, et al. Pointer sentinel mixture models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1609.07843.
[26]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
[27]	BISK Y, ZELLERS R, LE BRAS R, et al. PIQA: reasoning about physical commonsense in natural language[C]// The 34th AAAI Conference on Artificial Intelligence. Philadelphia: AAAI, 2020: 7432-7439.
[28]	CLARK C, LEE K, CHANG M W, et al. BoolQ: exploring the surprising difficulty of natural yes/no questions[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New York: ACL, 2019: 2924-2936.
[29]	MIHAYLOV T, CLARK P, KHOT T, et al. Can a suit of armor conduct electricity? A new dataset for open book question answering[C]// 2018 Conference on Empirical Methods in Natural Language Processing. New York: ACL, 2018: 2381-2391.
[30]	SAKAGUCHI K, LE BRAS R, BHAGAVATULA C, et al. WinoGrande: an adversarial winograd schema challenge at scale[J]. Communications of the ACM, 2021, 64(9): 99-106.
[31]	ZELLERS R, HOLTZMAN A, BISK Y, et al. HellaSwag: can a machine really finish your sentence?[C]// The 57th Annual Meeting of the Association for Computational Linguistics. New York: ACL, 2019: 4791-4800.
[32]	CLARK P, COWHEY I, ETZIONI O, et al. Think you have solved question answering? try ARC, the AI2 reasoning challenge[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1803.05457.
[33]	KRISHNAMOORTHI R. Quantizing deep convolutional networks for efficient inference: a whitepaper[EB/OL]. [2025-04-10]. https://arxiv.org/pdf/1806.08342.
[34]	LI Z T, YAN X L, ZHANG T N, et al. ARB-LLM: alternating refined binarizations for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2410.03129.
[35]	HUANG W, QIN H T, LIU Y D, et al. SliM-LLM: salience-driven mixed-precision quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2405.14917.

方法	位宽	7 B	13 B	70 B
BF16	16.00	5.47	4.88	3.32
RTN	2.00	17 788.00	51 145.00	26 066.00
GPTQ	2.00	60.45	19.70	9.12
QuIP	2.00	39.73	13.48	6.64
PBLLM	1.70	69.20	151.09	28.37
BILLM	2.08	32.48	16.77	8.41
ARB-LLM	2.08	16.44	11.85	6.16
SliM-LLM	2.16	16.01	9.41	6.28
Ours (K=50)	2.14	25.44	13.82	7.56
Ours (K=100)	2.14	7.88	6.57	4.79

方法	位宽	7 B	13 B	70 B
BF16	16.00	5.47	4.88	3.32
RTN	2.00	17 788.00	51 145.00	26 066.00
GPTQ	2.00	60.45	19.70	9.12
QuIP	2.00	39.73	13.48	6.64
PBLLM	1.70	69.20	151.09	28.37
BILLM	2.08	32.48	16.77	8.41
ARB-LLM	2.08	16.44	11.85	6.16
SliM-LLM	2.16	16.01	9.41	6.28
Ours (K=50)	2.14	25.44	13.82	7.56
Ours (K=100)	2.14	7.88	6.57	4.79

方法	分块	位宽	困惑(PPL↓)
BF16	-	16.00	6.97
GPTQ	128	2.00	43.24
QuIP	128	2.00	31.94
PBLLM	128	1.70	80.15
BILLM	128	2.08	40.52
ARB-LLM	128	2.08	20.12
SliM-LLM	128	2.16	16.00
Ours (K=50)	128	2.14	21.10
Ours (K=100)	128	2.14	10.20

方法	分块	位宽	困惑(PPL↓)
BF16	-	16.00	6.97
GPTQ	128	2.00	43.24
QuIP	128	2.00	31.94
PBLLM	128	1.70	80.15
BILLM	128	2.08	40.52
ARB-LLM	128	2.08	20.12
SliM-LLM	128	2.16	16.00
Ours (K=50)	128	2.14	21.10
Ours (K=100)	128	2.14	10.20

方法	位宽	PIQA↑	BoolQ↑	OBQA↑	Winogrande↑	ARC-e↑	ARC-c↑	Hellaswag↑	Average↑
BILLM	2.08	60.39	59.42	29.80	51.93	39.98	23.72	35.90	43.02
ARB-LLM	2.08	66.59	66.33	29.60	57.85	51.01	27.56	48.33	49.61
SliM-LLM	2.16	53.64	52.59	15.00	47.98	25.08	21.50	26.29	34.58
Ours (K=50)	2.14	68.66	63.27	21.00	58.09	50.00	26.19	40.01	46.75
Ours (K=100)	2.14	75.41	69.20	26.40	66.77	67.42	35.92	49.80	55.85