Emergent Mind


Recent research has explored the creation of questions from code submitted by students. These Questions about Learners' Code (QLCs) are created through program analysis, exploring execution paths, and then creating code comprehension questions from these paths and the broader code structure. Responding to the questions requires reading and tracing the code, which is known to support students' learning. At the same time, computing education researchers have witnessed the emergence of LLMs that have taken the community by storm. Researchers have demonstrated the applicability of these models especially in the introductory programming context, outlining their performance in solving introductory programming problems and their utility in creating new learning resources. In this work, we explore the capability of the state-of-the-art LLMs (GPT-3.5 and GPT-4) in answering QLCs that are generated from code that the LLMs have created. Our results show that although the state-of-the-art LLMs can create programs and trace program execution when prompted, they easily succumb to similar errors that have previously been recorded for novice programmers. These results demonstrate the fallibility of these models and perhaps dampen the expectations fueled by the recent LLM hype. At the same time, we also highlight future research possibilities such as using LLMs to mimic students as their behavior can indeed be similar for some specific tasks.

Methodology using LLM to create and analyze multiple program solutions and quality control checks.


  • Researchers analyzed GPT-3.5 and GPT-4's ability to comprehend and respond to self-generated code comprehension questions.

  • Experiment involved LLMs generating code, producing questions from it, and then attempting to answer those questions; responses were manually evaluated for accuracy.

  • GPT-4 outperformed GPT-3.5, showing better understanding in simple tasks but struggling with complex code constructs like loops and detailed trace tasks.

  • Study reveals potential improvements for LLM training and suggests using LLMs to create educational content and compare with human learning processes.

Exploring ChatGPT's Capacity to Answer Program Comprehension Questions from Self-Generated Code


Researchers at Aalto University have conducted an insightful analysis into the performance of state-of-the-art LLMs, particularly GPT-3.5 and GPT-4, in answering Questions about Learners' Code (QLCs). These QLCs were formulated from code snippets generated by the LLMs themselves, targeting a dual purpose: assessing the models' comprehension of programming constructs they create and understanding common error patterns in their responses.

Experiment Design

The experiment followed a structured sequence:

  1. The LLMs were tasked with generating program code based on provided exercise descriptions.
  2. From these generated programs, QLCs were automatically produced using the QLCpy-library.
  3. The LLMs subsequently attempted to answer these QLCs.
  4. Finally, the researchers manually analyzed the correctness of the LLM responses and categorized errors.

The QLCs aimed to test various aspects of program comprehension, such as variable roles, loop behaviors, and line-specific purposes, reflecting different cognitive levels in program understanding.

Findings and Observations

Performance Summary

Overall, GPT-4 demonstrated superior performance over GPT-3.5 across most QLC types, confirming the incremental improvements in newer LLM generations. The success rate varied significantly depending on the QLC type, with both models showing robust performance in identifying function parameters and variable names but struggling with more dynamic aspects like loop behaviors and trace requirements.

Error Analysis

A detailed error analysis highlighted both models' pitfalls:

  • Logical Errors: Both models occasionally produced illogical steps in code execution or misunderstood code semantics, issues also common among novice programmers.
  • Line Numbering Issues: Misinterpretation of line references within code suggests possible improvements in how LLMs map physical code structure during generation and comprehension tasks.
  • Response Inconsistencies: Particularly in GPT-3.5, inconsistencies in answer justification revealed a lack of coherence, where valid logical deductions were followed by incorrect final answers, or vice versa.
  • Hallucination in Justifications: GPT-4 occasionally adhered to an initially incorrect answer, fabricating justifications to support it, a phenomenon less observed in human cognition.

Implications and Future Opportunities

This research illuminates several pathways and considerations:

  • Model Training and Fine-Tuning: Enhancing training regimes to better encompass and distinguish between syntactic and semantic elements of code could improve LLM performance in both generating and comprehending code.
  • Educational Tools Development: LLMs could be integrated into educational platforms not just for solving problems but for generating pedagogical content, such as automated question generation and answer explanation models.
  • Comparative Studies with Human Learners: Similarities in error patterns between LLMs and students invite further studies to compare learning behaviors and miscomprehensions, potentially using LLM outputs as training data for educational research.


While the LLMs exhibited remarkable capabilities in answering self-generated code comprehension questions, evident limitations call for cautious optimism. The encountered errors, especially in logical reasoning and structural interpretation, underscore the challenges remaining in AI understanding of human-like code comprehension. Future LLM developments and applications, particularly in educational contexts, must carefully consider these aspects to leverage strengths and mitigate shortcomings effectively.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.