Emergent Mind

Abstract

LLMs have become the go-to solution for many NLP tasks due to their ability to tackle various problems and produce high-quality results. Specifically, they are increasingly used to automatically generate code, easing the burden on developers by handling repetitive tasks. However, this improvement in quality has led to high computational and memory demands, making LLMs inaccessible to users with limited resources. In this paper, we focus on Central Processing Unit (CPU)-compatible models and conduct a thorough semi-manual evaluation of their strengths and weaknesses in generating Python code. We enhance their performance by introducing a Chain-of-Thought prompt that guides the model in problem-solving. Additionally, we propose a dataset of 60 programming problems with varying difficulty levels for evaluation purposes. Our assessment also includes testing these models on two state-of-the-art datasets: HumanEval and EvalPlus. We commit to sharing our dataset and experimental results publicly to ensure transparency.

Overview of Python code generation process using the described system.

Overview

  • The paper evaluates the performance of CPU-compatible, open-source models for Python code generation, highlighting their accessibility compared to more resource-intensive models.

  • Assessment involves various quantized models like LLaMA, Mistral, Dolphin, and OpenHermes using a custom dataset and established datasets like HumanEval and EvalPlus.

  • Results indicate variability in model performance on tasks like adherence to output formats and computational efficiency in typical desktop environments.

  • Identifies challenges like output format compliance and resource requirements, suggesting enhanced training and expanded task coverage for future research.

Evaluation of Low-Cost CPU-Compatible Models for Python Code Generation

Introduction to CPU-Compatible Models in Python Code Generation

In the landscape of NLP, Python code generation has emerged as an essential task, fueled by the expansive use of the language and the need for automating coding tasks. LLMs have played a pivotal role in these advancements; however, their resource-intensive nature often limits their accessibility. This paper contributes to the field by evaluating the performance of various CPU-compatible, open-source models specifically in the context of Python code generation.

Experiment Setup and Models Evaluated

The exploration of CPU-compatible models is conducted using a selection of quantized models from the llama.cpp project, which is optimized for CPUs. Models examined include versions of LLaMA, Mistral, and other derivatives like Dolphin and OpenHermes, quantized to different levels (2-8 bits). The study leverages a custom dataset comprising sixty diverse Python coding problems, alongside established datasets such as HumanEval and EvalPlus, to gauge the models' code synthesis capabilities.

Key Outcomes and Model Comparisons

Performance Across Datasets

  • On a custom dataset, models generally struggled with correct format output, besides correct coding solutions. Notably:
  • Mistral variants showed robust problem comprehension and adherence to output format requirements.
  • Dolphin and OpenHermes models excelled in code generation but often failed to align outputs with the expected formats.
  • On HumanEval and EvalPlus, Dolphin models notably surpassed others, exhibiting strengths in actual code synthesis without format constraints.

Computational Efficiency

The study meticulously considers the operational feasibility on standard CPUs, emphasizing models' storage, RAM requirements, and inference times:

  • Models like Mistral and Llama demonstrated a balance between performance and computational demands.
  • The smallest models required less than 6 GB of space and around 5 GB of RAM, manageable within regular desktop environments.

Challenges and Limitations

While CPU-compatible models offer an accessible alternative to GPU-dependent ones, they encounter specific challenges:

  • Output Format Compliance: Some models, though effective in raw code generation, struggle with strict output format adherence, leading to potential penalties in structured evaluations.
  • Resource Requirements: Despite optimizations, the most powerful configurations of models like Mixtral still demand resources beyond typical CPU capacities, limiting their practical utility.

Future Research Directions

The continual evolution of CPU-friendly LLMs for coding tasks suggests several trajectories for future work:

  • Enhanced Model Training: Further refining model architectures and training paradigms to balance performance with resource efficiency.
  • Expanded Task Coverage: Investigating models' capabilities across a broader spectrum of coding-related tasks, such as code summarization, bug-fixing, or even cross-language translation.

Conclusion

This investigation underscores the significant potential of CPU-compatible models to democratize Python code generation, making it more accessible across varied computational environments. By highlighting specific strengths and weaknesses across different models and tasks, this research provides valuable insights that pave the way for future enhancements in the domain of AI-powered coding assistance.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.