一种基于内存对齐的大模型混合精度量化方法

doi:10.11996/JG.j.2095-302X.2026010039

摘要/Abstract

摘要：

随着大模型规模的不断增长，模型推理的内存占用和计算开销成为重要挑战。模型量化是降低模型资源消耗的有效方法，但现有方法在权重量化过程中存在离群点处理不足、量化精度损失显著以及内存访问效率低下等问题。为此，提出一种内存对齐的大模型混合精度量化方法，通过将模型参数表示成不同位宽的量化参数实现混合精度量化方法，在降低模型存储的同时缓解量化带来的精度损失问题。具体来说，基于小组显著性分析划分权重离群点，将模型参数按单指令多数据流(SIMD)单元对齐分组，并依据显著性对不同小组采用8 bit或2 bit量化；针对2 bit量化可能导致的精度损失，引入分块量化补偿策略。此外，设计了一种高效的混合精度权重打包与存储方案，通过位图(Bitmap)记录数据块位宽类型，支持随机访问。实验结果表明，该方法在保证模型精度的同时，显著降低了内存占用并提升了计算效率。通过在Llama2-7 B，13 B和70 B上进行验证，相比最先进的方法，在WikiText2和C4数据集上的困惑度(PPL)分别下降8.13，2.84，1.37及5.80，并且量化后的70 B模型相对BF16权重存储约减87%。此外在7个QA数据集上平均准确率提升6.24%。其结果表明，基于内存对齐的大模型混合精度量化方法能够同时提升压缩率、访存效率与模型性能。

关键词: 大模型压缩, 训练后量化, 低比特量化, 混合精度量化, 离群点划分

Abstract:

As large models continue to grow in scale, the memory footprint and computational overhead of model inference have become critical challenges. Mixed-precision quantization is an effective approach to reduce resource consumption, but existing methods suffer from insufficient outlier handling, significant quantization accuracy loss, and inefficient memory access. To address these issues, a memory-aligned mixed-precision quantization method for large models was proposed. First, weights were divided into SIMD-aligned groups, and outlier groups were identified via group-wise significance analysis, with high-significance groups quantized to 8 bit and others to 2 bit. A block-wise compensation strategy was introduced to mitigate accuracy degradation caused by 2 bit quantization. Furthermore, an efficient packing and storage scheme was designed for mixed-precision weights, where a bitmap was used to record the bit width of each data block, enabling random access. Experimental results demonstrated that the proposed method significantly reduced memory usage and improved computational efficiency while maintaining model accuracy. Specifically, on Llama2-7 B/13 B/70 B, the approach achieved perplexity reductions of 8.13/2.84/1.37 on WikiText-2 and 5.80 on C4 relative to state-of-the-art baselines. The quantized 70 B model reduced weight storage by approximately 87% compared with BF16. Across seven QA benchmarks, an average accuracy gain of 6.24% was achieved. Last, these results indicated that a mixed-precision quantization method for large language models via memory alignment could simultaneously improve compression ratio, memory-access efficiency, and overall model performance.

Key words: large language model compression, post-training quantization, low-bit quantization, mixed-precision quantization, outlier extraction

中图分类号:

TP391

李章明, 关伟凡, 常政威, 张凌浩, 胡庆浩. 一种基于内存对齐的大模型混合精度量化方法[J]. 图学学报, 2026, 47(1): 39-46.

LI Zhangming, GUAN Weifan, CHANG Zhengwei, ZHANG Linghao, HU Qinghao. A mixed-precision quantization method for large language models via memory alignment[J]. Journal of Graphics, 2026, 47(1): 39-46.

图/表 7

参考文献 35

[1]	GUO C, TANG J M, HU W M, et al. OliVe: accelerating large language models via hardware-friendly outlier-victim pair quantization[C]// The 50th Annual International Symposium on Computer Architecture. New York: ACM, 2023: 3.
[2]	SHANG Y Z, YUAN Z H, WU Q, et al. PB-LLM: partially binarized large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2310.00034.
[3]	HUANG W, LIU Y D, QIN H T, et al. BiLLM: pushing the limit of post-training quantization for LLMs[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2402.04291.
[4]	FRANTAR E, ALISTARH D. SparseGPT: massive language models can be accurately pruned in one-shot[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2301.00774.
[5]	SUN M J, LIU Z, BAIR A, et al. A simple and effective pruning approach for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.11695.
[6]	GU Y X, DONG L, WEI F R, et al. MiniLLM: knowledge distillation of large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.08543.
[7]	QIU J T, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]// 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York: ACM, 2016: 26-35.
[8]	LIN D D, TALATHI S S, ANNAPUREDDY V S. Fixed point quantization of deep convolutional networks[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1511.06393.
[9]	COURBARIAUX M, BENGIO Y, DAVID J P. BinaryConnect: training deep neural networks with binary weights during propagations[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1511.00363.
[10]	LIU Z C, OGUZ B, ZHAO C S, et al. LLM-QAT: data-free quantization aware training for large language models[C]// Findings of the Association for Computational Linguistics. New York: ACL, 2024: 467-484.
[11]	FRANTAR E, ASHKBOOS S, HOEFLER T, et al. GPTQ: accurate post-training quantization for generative pre-trained transformers[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2210.17323.
[12]	LEE J H, KIM J, YANG J Y, et al. LRQ: optimizing post-training quantization for large language models by learning low-rank weight-scaling matrices[C]// 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics:Human Language Technologies. New York: ACL, 2025: 7708-7743.
[13]	KIM J, EL HALABI M, PARK W, et al. GuidedQuant: large language model quantization via exploiting end loss guidance[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2505.07004.
[14]	DETTMERS T, LEWIS M, BELKADA Y, et al. LLM.int8(): 8-bit matrix multiplication for transformers at scale[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2208.07339.
[15]	YAO Z W, AMINABADI R Y, ZHANG M J, et al. ZeroQuant: efficient and affordable post-training quantization for large-scale transformers[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2206.01861.
[16]	XIAO G X, LIN J, SEZNEC M, et al. SmoothQuant: accurate and efficient post-training quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2211.10438.
[17]	SHAO W Q, CHEN M Z, ZHANG Z Y, et al. OmniQuant: omnidirectionally calibrated quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2308.13137.
[18]	YAO Z W, WU X X, LI C, et al. ZeroQuant-V2:exploring post-training quantization in LLMs from comprehensive study to low rank compensation[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2303.08302.
[19]	LIN J, TANG J M, TANG H T, et al. AWQ: activation-aware weight quantization for LLM compression and acceleration[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.00978.
[20]	DETTMERS T, SVIRSCHEVSKI R, EGIAZARIAN V, et al. SpQR: a sparse-quantized representation for near-lossless LLM weight compression[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.03078.
[21]	CHEE J, CAI Y H, KULESHOV V, et al. QuIP:2-bit quantization of large language models with guarantees[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2307.13304.
[22]	LI L, LI Q Y, ZHANG B, et al. Norm tweaking: High-performance low-bit quantization of large language models[C]// The 38th AAAI Conference on Artificial Intelligence. Philadelphia: AAAI, 2024: 18536-18544.
[23]	BEHDIN K, ACHARYA A, GUPTA A, et al. QuantEase: optimization-based quantization for language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2309.01885.
[24]	YUAN Z H, NIU L, LIU J W, et al. RPTQ: reorder-based post-training quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2304.01089.
[25]	MERITY S, XIONG C M, BRADBURY J, et al. Pointer sentinel mixture models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1609.07843.
[26]	RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
[27]	BISK Y, ZELLERS R, LE BRAS R, et al. PIQA: reasoning about physical commonsense in natural language[C]// The 34th AAAI Conference on Artificial Intelligence. Philadelphia: AAAI, 2020: 7432-7439.
[28]	CLARK C, LEE K, CHANG M W, et al. BoolQ: exploring the surprising difficulty of natural yes/no questions[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New York: ACL, 2019: 2924-2936.
[29]	MIHAYLOV T, CLARK P, KHOT T, et al. Can a suit of armor conduct electricity? A new dataset for open book question answering[C]// 2018 Conference on Empirical Methods in Natural Language Processing. New York: ACL, 2018: 2381-2391.
[30]	SAKAGUCHI K, LE BRAS R, BHAGAVATULA C, et al. WinoGrande: an adversarial winograd schema challenge at scale[J]. Communications of the ACM, 2021, 64(9): 99-106.
[31]	ZELLERS R, HOLTZMAN A, BISK Y, et al. HellaSwag: can a machine really finish your sentence?[C]// The 57th Annual Meeting of the Association for Computational Linguistics. New York: ACL, 2019: 4791-4800.
[32]	CLARK P, COWHEY I, ETZIONI O, et al. Think you have solved question answering? try ARC, the AI2 reasoning challenge[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1803.05457.
[33]	KRISHNAMOORTHI R. Quantizing deep convolutional networks for efficient inference: a whitepaper[EB/OL]. [2025-04-10]. https://arxiv.org/pdf/1806.08342.
[34]	LI Z T, YAN X L, ZHANG T N, et al. ARB-LLM: alternating refined binarizations for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2410.03129.
[35]	HUANG W, QIN H T, LIU Y D, et al. SliM-LLM: salience-driven mixed-precision quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2405.14917.

方法	位宽	7 B	13 B	70 B
BF16	16.00	5.47	4.88	3.32
RTN	2.00	17 788.00	51 145.00	26 066.00
GPTQ	2.00	60.45	19.70	9.12
QuIP	2.00	39.73	13.48	6.64
PBLLM	1.70	69.20	151.09	28.37
BILLM	2.08	32.48	16.77	8.41
ARB-LLM	2.08	16.44	11.85	6.16
SliM-LLM	2.16	16.01	9.41	6.28
Ours (K=50)	2.14	25.44	13.82	7.56
Ours (K=100)	2.14	7.88	6.57	4.79

方法	位宽	7 B	13 B	70 B
BF16	16.00	5.47	4.88	3.32
RTN	2.00	17 788.00	51 145.00	26 066.00
GPTQ	2.00	60.45	19.70	9.12
QuIP	2.00	39.73	13.48	6.64
PBLLM	1.70	69.20	151.09	28.37
BILLM	2.08	32.48	16.77	8.41
ARB-LLM	2.08	16.44	11.85	6.16
SliM-LLM	2.16	16.01	9.41	6.28
Ours (K=50)	2.14	25.44	13.82	7.56
Ours (K=100)	2.14	7.88	6.57	4.79

方法	分块	位宽	困惑(PPL↓)
BF16	-	16.00	6.97
GPTQ	128	2.00	43.24
QuIP	128	2.00	31.94
PBLLM	128	1.70	80.15
BILLM	128	2.08	40.52
ARB-LLM	128	2.08	20.12
SliM-LLM	128	2.16	16.00
Ours (K=50)	128	2.14	21.10
Ours (K=100)	128	2.14	10.20

方法	分块	位宽	困惑(PPL↓)
BF16	-	16.00	6.97
GPTQ	128	2.00	43.24
QuIP	128	2.00	31.94
PBLLM	128	1.70	80.15
BILLM	128	2.08	40.52
ARB-LLM	128	2.08	20.12
SliM-LLM	128	2.16	16.00
Ours (K=50)	128	2.14	21.10
Ours (K=100)	128	2.14	10.20

方法	位宽	PIQA↑	BoolQ↑	OBQA↑	Winogrande↑	ARC-e↑	ARC-c↑	Hellaswag↑	Average↑
BILLM	2.08	60.39	59.42	29.80	51.93	39.98	23.72	35.90	43.02
ARB-LLM	2.08	66.59	66.33	29.60	57.85	51.01	27.56	48.33	49.61
SliM-LLM	2.16	53.64	52.59	15.00	47.98	25.08	21.50	26.29	34.58
Ours (K=50)	2.14	68.66	63.27	21.00	58.09	50.00	26.19	40.01	46.75
Ours (K=100)	2.14	75.41	69.20	26.40	66.77	67.42	35.92	49.80	55.85