Welcome to Journal of Graphics share: 

Journal of Graphics ›› 2026, Vol. 47 ›› Issue (1): 39-46.DOI: 10.11996/JG.j.2095-302X.2026010039

• Image Processing and Computer Vision • Previous Articles     Next Articles

A mixed-precision quantization method for large language models via memory alignment

LI Zhangming1, GUAN Weifan1, CHANG Zhengwei2, ZHANG Linghao2, HU Qinghao1()   

  1. 1 The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
    2 State Grid Sichuan Electric Power Company, Chengdu Sichuan 610041, China
  • Received:2025-06-10 Accepted:2025-10-11 Online:2026-02-28 Published:2026-03-16
  • Contact: HU Qinghao
  • Supported by:
    Science and Technology Project of State Grid Corporation of China(5700-202426249A-1-1-ZN)

Abstract:

As large models continue to grow in scale, the memory footprint and computational overhead of model inference have become critical challenges. Mixed-precision quantization is an effective approach to reduce resource consumption, but existing methods suffer from insufficient outlier handling, significant quantization accuracy loss, and inefficient memory access. To address these issues, a memory-aligned mixed-precision quantization method for large models was proposed. First, weights were divided into SIMD-aligned groups, and outlier groups were identified via group-wise significance analysis, with high-significance groups quantized to 8 bit and others to 2 bit. A block-wise compensation strategy was introduced to mitigate accuracy degradation caused by 2 bit quantization. Furthermore, an efficient packing and storage scheme was designed for mixed-precision weights, where a bitmap was used to record the bit width of each data block, enabling random access. Experimental results demonstrated that the proposed method significantly reduced memory usage and improved computational efficiency while maintaining model accuracy. Specifically, on Llama2-7 B/13 B/70 B, the approach achieved perplexity reductions of 8.13/2.84/1.37 on WikiText-2 and 5.80 on C4 relative to state-of-the-art baselines. The quantized 70 B model reduced weight storage by approximately 87% compared with BF16. Across seven QA benchmarks, an average accuracy gain of 6.24% was achieved. Last, these results indicated that a mixed-precision quantization method for large language models via memory alignment could simultaneously improve compression ratio, memory-access efficiency, and overall model performance.

Key words: large language model compression, post-training quantization, low-bit quantization, mixed-precision quantization, outlier extraction

CLC Number: