Emergent Mind

Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models

(2401.00788)
Published Jan 1, 2024 in cs.CL , cs.AI , and cs.SE

Abstract

The high cost of full-parameter fine-tuning (FFT) of LLMs has led to a series of parameter-efficient fine-tuning (PEFT) methods. However, it remains unclear which methods provide the best cost-performance trade-off at different model scales. We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters. Through investigations across 5 tasks and 8 different datasets encompassing both code comprehension and code generation tasks, we find that FFT generally leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale. LoRA usually offers the most favorable trade-off between cost and performance. Further investigation into the effects of these methods on both model robustness and code security reveals that larger models tend to demonstrate reduced robustness and less security. At last, we explore the relationships among updated parameters, cross-entropy loss, and task performance. We find that the tuning effectiveness observed in small models generalizes well to larger models, and the validation loss in instruction tuning can be a reliable indicator of overall downstream performance.

Astraios models' mean task performance across multiple tasks and datasets, showing parameters updated per PEFT method.

Overview

  • The paper introduces Astraios, a framework for evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on instruction-tuned Code LLMs across various scales.

  • Astraios tested 28 models, including versions of the OctoCoder model with up to 16 billion parameters, across multiple coding tasks to compare performance.

  • Findings indicate Full Fine-Tuning (FFT) generally outperforms PEFT as models scale, though PEFT efficiency varies with model size and is often optimal with LoRA.

  • Larger models show superior code generation abilities but decreased code comprehension, robustness, and security against adversarial inputs.

  • The paper highlights the importance of understanding the trade-offs between model size, cost, performance, robustness, and security in the development of Code LLMs.

Introduction to Parameter-Efficient Tuning of LLMs

The evolution of LLMs in software engineering has led to enhanced performance in tasks such as code comprehension and code generation. Current advancements point towards instruction-tuned Code LLMs that are tailored to understand human instructions and perform across a variety of tasks without specific task-oriented fine-tuning. However, as models become larger, fully fine-tuning every parameter (FFT) becomes prohibitively costly, pushing the field towards more efficient strategies, namely Parameter-Efficient Fine-Tuning (PEFT) methods. This study evaluates these PEFT methods across different model scales to determine their impact on model performance, robustness, and security.

Analyzing the PEFT Methods

Researchers developed Astraios, a framework featuring 28 instruction-tuned models based on the OctoCoder model with up to 16 billion parameters. This set includes adjustments using 7 different PEFT methods. Several tasks, including code generation and code comprehension, were tested on multiple datasets to meticulously evaluate the models. The findings indicate FFT tends to outperform PEFT at scale, yet efficiency varies by model size, with LoRA often presenting as the optimal balance between cost and effectiveness.

Model Scaling and Fine-Tuning Impact

Interestingly, larger models excel in code generation tasks but do not extend the same pattern to code comprehension. Moreover, these sizable models are prone to decreased robustness and heightened security vulnerabilities, which suggests larger instruction-tuned Code LLMs face a trade-off between generating high-quality code and staying secure and reliable against adversarial inputs. The researchers also observed a strong correlation between tuning validation loss and downstream performance, indicating that tuning loss can serve as a proxy for the model's broader capabilities.

Model Robustness and Security

Beyond task execution efficiency, the study underscores the significance of model robustness and security. Evaluation with perturbed data and security-focused benchmarks revealed that models with fewer updated parameters can sometimes offer greater robustness. However, an increase in model size correlates with diminishing robustness and a tendency to generate insecure code more frequently.

Concluding Thoughts

The paper's exploratory journey through model fine-tuning emphasizes the intricate relationships among size, costs, performance, robustness, and security. With a comprehensive model suite, Astraios enables an in-depth understanding of these dynamics and provides critical insights into the path forward in developing more sophisticated and reliable Code LLMs.

Acknowledgements and Contributions

The research benefited from contributions and support from numerous institutions, individuals, and the community, fostering collaborations that span across academia and industry, highlighting the collective effort in the advancement of AI and machine learning in software engineering.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328.
  2. Scaling Laws for Generative Mixed-Modal Language Models
  3. SantaCoder: don't reach for the stars!
  4. PaLM 2 Technical Report
  5. Is github’s copilot as bad as humans at introducing vulnerabilities in code? Empirical Software Engineering, 28(6):1–24
  6. Qwen Technical Report
  7. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness.

  8. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  9. Pavol Bielik and Martin Vechev. 2020. Adversarial robustness for code. In International Conference on Machine Learning, pages 896–907. PMLR.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  11. Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.

  12. Revisiting Parameter-Efficient Tuning: Are We Really There Yet? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2612–2626
  13. Evaluating Large Language Models Trained on Code
  14. Scaling Instruction-Finetuned Language Models
  15. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734
  16. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  17. Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models
  18. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235.
  19. KronA: Parameter Efficient Tuning with Kronecker Adapter
  20. InCoder: A Generative Model for Code Infilling and Synthesis. In The Eleventh International Conference on Learning Representations.
  21. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807.
  22. Robust Transfer Learning with Pretrained Language Models through Adapters. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 854–861.
  23. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations.
  24. Parameter-efficient model adaptation for vision transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 817–825.
  25. Scaling Laws for Autoregressive Generative Modeling
  26. Semantic robustness of models of source code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 526–537. IEEE.
  27. Scaling Laws for Transfer
  28. Training Compute-Optimal Large Language Models
  29. Large Language Models for Software Engineering: A Systematic Literature Review
  30. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  31. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  32. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models
  33. Scaling Laws for Neural Language Models
  34. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
  35. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  36. StarCoder: may the source be with you!
  37. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  38. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  39. GPT understands, too. AI Open.
  40. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  41. WizardCoder: Empowering Code Large Language Models with Evol-Instruct
  42. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.

  43. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119.
  44. Inverse Scaling: When Bigger Isn't Better
  45. OctoPack: Instruction Tuning Code Large Language Models
  46. Scaling Data-Constrained Language Models
  47. Crosslingual Generalization through Multitask Finetuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15991–16111, Toronto, Canada. Association for Computational Linguistics.
  48. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In The Eleventh International Conference on Learning Representations.
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  50. Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code
  51. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE.
  52. True few-shot learning with language models. Advances in Neural Information Processing Systems, 34:11054–11070.
  53. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673.
  54. Language models are unsupervised multitask learners
  55. Code Llama: Open Foundation Models for Code
  56. PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback
  57. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237.
  58. Jeffrey Svajlenko and Chanchal K Roy. 2021. Bigclonebench. Code Clone Analysis: Research, Tools, and Practices, pages 93–105.
  59. Memorization without overfitting: Analyzing the training dynamics of large language models. Advances in Neural Information Processing Systems, 35:38274–38290.
  60. Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 371–383.
  61. What do they capture? a structural analysis of pre-trained language models for source code. In Proceedings of the 44th International Conference on Software Engineering, pages 2377–2388.
  62. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 382–394.
  63. ReCode: Robustness Evaluation of Code Generation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13818–13843, Toronto, Canada. Association for Computational Linguistics.
  64. CodeT5+: Open Code Large Language Models for Code Understanding and Generation
  65. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708.
  66. Multitask Prompt Tuning Enables Parameter-Efficient Transfer Learning. In The Eleventh International Conference on Learning Representations.
  67. Inverse scaling can become U-shaped
  68. Skywork: A More Open Bilingual Foundation Model
  69. Ethical and social risks of harm from Language Models
  70. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  71. Conversational Automated Program Repair
  72. Training Trajectories of Language Models Across Scales. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13711–13738, Toronto, Canada. Association for Computational Linguistics.
  73. OpenAgents: An Open Platform for Language Agents in the Wild
  74. Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
  75. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9.
  76. Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning. In The Eleventh International Conference on Learning Representations.
  77. OPT: Open Pre-trained Transformer Language Models
  78. A Survey of Large Language Models
  79. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684.
  80. Making parameter-efficient tuning more efficient: A unified framework for classification tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 7053–7064.
  81. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32.
  82. Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names?
  83. Red teaming ChatGPT via Jailbreaking: Bias, Robustness, Reliability and Toxicity
  84. Source Code Data Augmentation for Deep Learning: A Survey

Show All 84