Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources (2310.07147v2)

Published 11 Oct 2023 in cs.CL and cs.LG

Abstract: LLMs have showcased remarkable impacts across a wide spectrum of natural language processing tasks. Fine-tuning these pretrained models on downstream datasets provides further significant performance gains; however, this process typically requires a large number of expensive, high-end GPUs. Although there have been efforts focused on parameter-efficient fine-tuning, they cannot fully unlock the powerful potential of full-parameter fine-tuning. In this paper, we propose QFT, a Quantized Full-parameter Tuning framework for LLMs that quantizes and stores all training states, including weights, gradients, and optimizer states, in INT8 format to reduce training memory, thereby enabling full-parameter fine-tuning on existing GPUs at an affordable cost. To ensure training performance, we make two key efforts: i) for quantized gradients and optimizer states, we theoretically prove that the Lion optimizer, with its property of consistent update magnitudes, is highly robust to quantization; ii) and for quantized weights, we employ the hybrid feature quantizer, which identifies and protects a small subset of sparse critical features while quantizing the remaining dense features, thus ensuring accurate weight updates without FP32 backups. Moreover, to support backpropagation in the integer context, we develop a stack-based gradient flow scheme with O(1) complexity, forming a unified integer training pipeline. As a result, QFT reduces the model state memory to 21% of the standard solution while achieving comparable performance, e.g., tuning a LLaMA-7B model requires only <30GB of memory, making it feasible on a single A6000 GPU.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  3. Symbolic discovery of optimization algorithms. arxiv 2023. arXiv preprint arXiv:2302.06675, 2023.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  6. 8-bit optimizers via block-wise quantization. In International Conference on Learning Representations, 2021.
  7. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  8. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
  9. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  293–302, 2019.
  10. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
  11. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  12. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  13. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pp.  291–326. Chapman and Hall/CRC, 2022.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 1341–1355, 2020.
  17. HuggingFace. Open llm leaderboard, 2023a. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  18. HuggingFace. Sharegpt data, 2023b. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered?doi=true.
  19. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2704–2713, 2018.
  20. Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
  21. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
  22. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  23. Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
  24. Efficient rematerialization for deep networks. Advances in Neural Information Processing Systems, 32, 2019.
  25. Memory efficient optimizers with 4-bit states. arXiv preprint arXiv:2309.01507, 2023.
  26. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  27. I-vit: integer-only quantization for efficient vision transformer inference. arXiv preprint arXiv:2207.01405, 2022.
  28. Patch similarity aware data-free quantization for vision transformers. In European Conference on Computer Vision, pp.  154–170. Springer, 2022a.
  29. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. arXiv preprint arXiv:2212.08254, 2022b.
  30. Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  31. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  32. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  33. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
  36. Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
  37. Capuchin: Tensor-based gpu memory management for deep learning. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 891–905, 2020.
  38. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  39. shareGPT. Sharegpt, 2023. URL https://sharegpt.com/.
  40. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp. 4596–4604. PMLR, 2018.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  43. Superneurons: Dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp.  41–53, 2018.
  44. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  45. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  46. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  47. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.