The high cost of full-parameter fine-tuning (FFT) of LLMs has led to a series of parameter-efficient fine-tuning (PEFT) methods. However, it remains unclear which methods provide the best cost-performance trade-off at different model scales. We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters. Through investigations across 5 tasks and 8 different datasets encompassing both code comprehension and code generation tasks, we find that FFT generally leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale. LoRA usually offers the most favorable trade-off between cost and performance. Further investigation into the effects of these methods on both model robustness and code security reveals that larger models tend to demonstrate reduced robustness and less security. At last, we explore the relationships among updated parameters, cross-entropy loss, and task performance. We find that the tuning effectiveness observed in small models generalizes well to larger models, and the validation loss in instruction tuning can be a reliable indicator of overall downstream performance.
The paper introduces Astraios, a framework for evaluating Parameter-Efficient Fine-Tuning (PEFT) methods on instruction-tuned Code LLMs across various scales.
Astraios tested 28 models, including versions of the OctoCoder model with up to 16 billion parameters, across multiple coding tasks to compare performance.
Findings indicate Full Fine-Tuning (FFT) generally outperforms PEFT as models scale, though PEFT efficiency varies with model size and is often optimal with LoRA.
Larger models show superior code generation abilities but decreased code comprehension, robustness, and security against adversarial inputs.
The paper highlights the importance of understanding the trade-offs between model size, cost, performance, robustness, and security in the development of Code LLMs.
The evolution of LLMs in software engineering has led to enhanced performance in tasks such as code comprehension and code generation. Current advancements point towards instruction-tuned Code LLMs that are tailored to understand human instructions and perform across a variety of tasks without specific task-oriented fine-tuning. However, as models become larger, fully fine-tuning every parameter (FFT) becomes prohibitively costly, pushing the field towards more efficient strategies, namely Parameter-Efficient Fine-Tuning (PEFT) methods. This study evaluates these PEFT methods across different model scales to determine their impact on model performance, robustness, and security.
Researchers developed Astraios, a framework featuring 28 instruction-tuned models based on the OctoCoder model with up to 16 billion parameters. This set includes adjustments using 7 different PEFT methods. Several tasks, including code generation and code comprehension, were tested on multiple datasets to meticulously evaluate the models. The findings indicate FFT tends to outperform PEFT at scale, yet efficiency varies by model size, with LoRA often presenting as the optimal balance between cost and effectiveness.
Interestingly, larger models excel in code generation tasks but do not extend the same pattern to code comprehension. Moreover, these sizable models are prone to decreased robustness and heightened security vulnerabilities, which suggests larger instruction-tuned Code LLMs face a trade-off between generating high-quality code and staying secure and reliable against adversarial inputs. The researchers also observed a strong correlation between tuning validation loss and downstream performance, indicating that tuning loss can serve as a proxy for the model's broader capabilities.
Beyond task execution efficiency, the study underscores the significance of model robustness and security. Evaluation with perturbed data and security-focused benchmarks revealed that models with fewer updated parameters can sometimes offer greater robustness. However, an increase in model size correlates with diminishing robustness and a tendency to generate insecure code more frequently.
The paper's exploratory journey through model fine-tuning emphasizes the intricate relationships among size, costs, performance, robustness, and security. With a comprehensive model suite, Astraios enables an in-depth understanding of these dynamics and provides critical insights into the path forward in developing more sophisticated and reliable Code LLMs.
The research benefited from contributions and support from numerous institutions, individuals, and the community, fostering collaborations that span across academia and industry, highlighting the collective effort in the advancement of AI and machine learning in software engineering.
A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness.
Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model for code generation. https://github.com/sahil280114/codealpaca.
PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.