欢迎访问《图学学报》 分享到:

图学学报 ›› 2026, Vol. 47 ›› Issue (1): 39-46.DOI: 10.11996/JG.j.2095-302X.2026010039

• 图像处理与计算机视觉 • 上一篇    下一篇

一种基于内存对齐的大模型混合精度量化方法

李章明1, 关伟凡1, 常政威2, 张凌浩2, 胡庆浩1()   

  1. 1 中国科学院自动化研究所复杂系统认知与决策重点实验室北京 100190
    2 国网四川省电力公司四川 成都 610041
  • 收稿日期:2025-06-10 接受日期:2025-10-11 出版日期:2026-02-28 发布日期:2026-03-16
  • 通讯作者:胡庆浩,E-mail:huqinghao2014@ia.ac.cn
  • 基金资助:
    国家电网有限公司科技项目(5700-202426249A-1-1-ZN)

A mixed-precision quantization method for large language models via memory alignment

LI Zhangming1, GUAN Weifan1, CHANG Zhengwei2, ZHANG Linghao2, HU Qinghao1()   

  1. 1 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    2 State Grid Sichuan Electric Power Company, Chengdu Sichuan 610041, China
  • Received:2025-06-10 Accepted:2025-10-11 Published:2026-02-28 Online:2026-03-16
  • Supported by:
    Science and Technology Project of State Grid Corporation of China(5700-202426249A-1-1-ZN)

摘要:

随着大模型规模的不断增长,模型推理的内存占用和计算开销成为重要挑战。模型量化是降低模型资源消耗的有效方法,但现有方法在权重量化过程中存在离群点处理不足、量化精度损失显著以及内存访问效率低下等问题。为此,提出一种内存对齐的大模型混合精度量化方法,通过将模型参数表示成不同位宽的量化参数实现混合精度量化方法,在降低模型存储的同时缓解量化带来的精度损失问题。具体来说,基于小组显著性分析划分权重离群点,将模型参数按单指令多数据流(SIMD)单元对齐分组,并依据显著性对不同小组采用8 bit或2 bit量化;针对2 bit量化可能导致的精度损失,引入分块量化补偿策略。此外,设计了一种高效的混合精度权重打包与存储方案,通过位图(Bitmap)记录数据块位宽类型,支持随机访问。实验结果表明,该方法在保证模型精度的同时,显著降低了内存占用并提升了计算效率。通过在Llama2-7 B,13 B和70 B上进行验证,相比最先进的方法,在WikiText2和C4数据集上的困惑度(PPL)分别下降8.13,2.84,1.37及5.80,并且量化后的70 B模型相对BF16权重存储约减87%。此外在7个QA数据集上平均准确率提升6.24%。其结果表明,基于内存对齐的大模型混合精度量化方法能够同时提升压缩率、访存效率与模型性能。

关键词: 大模型压缩, 训练后量化, 低比特量化, 混合精度量化, 离群点划分

Abstract:

As large models continue to grow in scale, the memory footprint and computational overhead of model inference have become critical challenges. Mixed-precision quantization is an effective approach to reduce resource consumption, but existing methods suffer from insufficient outlier handling, significant quantization accuracy loss, and inefficient memory access. To address these issues, a memory-aligned mixed-precision quantization method for large models was proposed. First, weights were divided into SIMD-aligned groups, and outlier groups were identified via group-wise significance analysis, with high-significance groups quantized to 8 bit and others to 2 bit. A block-wise compensation strategy was introduced to mitigate accuracy degradation caused by 2 bit quantization. Furthermore, an efficient packing and storage scheme was designed for mixed-precision weights, where a bitmap was used to record the bit width of each data block, enabling random access. Experimental results demonstrated that the proposed method significantly reduced memory usage and improved computational efficiency while maintaining model accuracy. Specifically, on Llama2-7 B/13 B/70 B, the approach achieved perplexity reductions of 8.13/2.84/1.37 on WikiText-2 and 5.80 on C4 relative to state-of-the-art baselines. The quantized 70 B model reduced weight storage by approximately 87% compared with BF16. Across seven QA benchmarks, an average accuracy gain of 6.24% was achieved. Last, these results indicated that a mixed-precision quantization method for large language models via memory alignment could simultaneously improve compression ratio, memory-access efficiency, and overall model performance.

Key words: large language model compression, post-training quantization, low-bit quantization, mixed-precision quantization, outlier extraction

中图分类号: