| [1] |
GUO C, TANG J M, HU W M, et al. OliVe: accelerating large language models via hardware-friendly outlier-victim pair quantization[C]// The 50th Annual International Symposium on Computer Architecture. New York: ACM, 2023: 3.
|
| [2] |
SHANG Y Z, YUAN Z H, WU Q, et al. PB-LLM: partially binarized large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2310.00034.
|
| [3] |
HUANG W, LIU Y D, QIN H T, et al. BiLLM: pushing the limit of post-training quantization for LLMs[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2402.04291.
|
| [4] |
FRANTAR E, ALISTARH D. SparseGPT: massive language models can be accurately pruned in one-shot[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2301.00774.
|
| [5] |
SUN M J, LIU Z, BAIR A, et al. A simple and effective pruning approach for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.11695.
|
| [6] |
GU Y X, DONG L, WEI F R, et al. MiniLLM: knowledge distillation of large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.08543.
|
| [7] |
QIU J T, WANG J, YAO S, et al. Going deeper with embedded FPGA platform for convolutional neural network[C]// 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York: ACM, 2016: 26-35.
|
| [8] |
LIN D D, TALATHI S S, ANNAPUREDDY V S. Fixed point quantization of deep convolutional networks[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1511.06393.
|
| [9] |
COURBARIAUX M, BENGIO Y, DAVID J P. BinaryConnect: training deep neural networks with binary weights during propagations[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1511.00363.
|
| [10] |
LIU Z C, OGUZ B, ZHAO C S, et al. LLM-QAT: data-free quantization aware training for large language models[C]// Findings of the Association for Computational Linguistics. New York: ACL, 2024: 467-484.
|
| [11] |
FRANTAR E, ASHKBOOS S, HOEFLER T, et al. GPTQ: accurate post-training quantization for generative pre-trained transformers[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2210.17323.
|
| [12] |
LEE J H, KIM J, YANG J Y, et al. LRQ: optimizing post-training quantization for large language models by learning low-rank weight-scaling matrices[C]// 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics:Human Language Technologies. New York: ACL, 2025: 7708-7743.
|
| [13] |
KIM J, EL HALABI M, PARK W, et al. GuidedQuant: large language model quantization via exploiting end loss guidance[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2505.07004.
|
| [14] |
DETTMERS T, LEWIS M, BELKADA Y, et al. LLM.int8(): 8-bit matrix multiplication for transformers at scale[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2208.07339.
|
| [15] |
YAO Z W, AMINABADI R Y, ZHANG M J, et al. ZeroQuant: efficient and affordable post-training quantization for large-scale transformers[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2206.01861.
|
| [16] |
XIAO G X, LIN J, SEZNEC M, et al. SmoothQuant: accurate and efficient post-training quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2211.10438.
|
| [17] |
SHAO W Q, CHEN M Z, ZHANG Z Y, et al. OmniQuant: omnidirectionally calibrated quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2308.13137.
|
| [18] |
YAO Z W, WU X X, LI C, et al. ZeroQuant-V2:exploring post-training quantization in LLMs from comprehensive study to low rank compensation[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2303.08302.
|
| [19] |
LIN J, TANG J M, TANG H T, et al. AWQ: activation-aware weight quantization for LLM compression and acceleration[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.00978.
|
| [20] |
DETTMERS T, SVIRSCHEVSKI R, EGIAZARIAN V, et al. SpQR: a sparse-quantized representation for near-lossless LLM weight compression[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2306.03078.
|
| [21] |
CHEE J, CAI Y H, KULESHOV V, et al. QuIP:2-bit quantization of large language models with guarantees[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2307.13304.
|
| [22] |
LI L, LI Q Y, ZHANG B, et al. Norm tweaking: High-performance low-bit quantization of large language models[C]// The 38th AAAI Conference on Artificial Intelligence. Philadelphia: AAAI, 2024: 18536-18544.
|
| [23] |
BEHDIN K, ACHARYA A, GUPTA A, et al. QuantEase: optimization-based quantization for language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2309.01885.
|
| [24] |
YUAN Z H, NIU L, LIU J W, et al. RPTQ: reorder-based post-training quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2304.01089.
|
| [25] |
MERITY S, XIONG C M, BRADBURY J, et al. Pointer sentinel mixture models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1609.07843.
|
| [26] |
RAFFEL C, SHAZEER N, ROBERTS A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 140.
|
| [27] |
BISK Y, ZELLERS R, LE BRAS R, et al. PIQA: reasoning about physical commonsense in natural language[C]// The 34th AAAI Conference on Artificial Intelligence. Philadelphia: AAAI, 2020: 7432-7439.
|
| [28] |
CLARK C, LEE K, CHANG M W, et al. BoolQ: exploring the surprising difficulty of natural yes/no questions[C]// 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. New York: ACL, 2019: 2924-2936.
|
| [29] |
MIHAYLOV T, CLARK P, KHOT T, et al. Can a suit of armor conduct electricity? A new dataset for open book question answering[C]// 2018 Conference on Empirical Methods in Natural Language Processing. New York: ACL, 2018: 2381-2391.
|
| [30] |
SAKAGUCHI K, LE BRAS R, BHAGAVATULA C, et al. WinoGrande: an adversarial winograd schema challenge at scale[J]. Communications of the ACM, 2021, 64(9): 99-106.
|
| [31] |
ZELLERS R, HOLTZMAN A, BISK Y, et al. HellaSwag: can a machine really finish your sentence?[C]// The 57th Annual Meeting of the Association for Computational Linguistics. New York: ACL, 2019: 4791-4800.
|
| [32] |
CLARK P, COWHEY I, ETZIONI O, et al. Think you have solved question answering? try ARC, the AI2 reasoning challenge[EB/OL]. [2025-04-10]. https://arxiv.org/abs/1803.05457.
|
| [33] |
KRISHNAMOORTHI R. Quantizing deep convolutional networks for efficient inference: a whitepaper[EB/OL]. [2025-04-10]. https://arxiv.org/pdf/1806.08342.
|
| [34] |
LI Z T, YAN X L, ZHANG T N, et al. ARB-LLM: alternating refined binarizations for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2410.03129.
|
| [35] |
HUANG W, QIN H T, LIU Y D, et al. SliM-LLM: salience-driven mixed-precision quantization for large language models[EB/OL]. [2025-04-10]. https://arxiv.org/abs/2405.14917.
|