As AI systems like language models are increasingly integrated into decision-making processes affecting people's lives, it's critical to ensure that these systems have sound moral reasoning. To test whether they do, we need to develop systematic evaluations. We provide a framework that uses a language model to translate causal graphs that capture key aspects of moral dilemmas into prompt templates. With this framework, we procedurally generated a large and diverse set of moral dilemmas -- the OffTheRails benchmark -- consisting of 50 scenarios and 400 unique test items. We collected moral permissibility and intention judgments from human participants for a subset of our items and compared these judgments to those from two language models (GPT-4 and Claude-2) across eight conditions. We find that moral dilemmas in which the harm is a necessary means (as compared to a side effect) resulted in lower permissibility and higher intention ratings for both participants and language models. The same pattern was observed for evitable versus inevitable harmful outcomes. However, there was no clear effect of whether the harm resulted from an agent's action versus from having omitted to act. We discuss limitations of our prompt generation pipeline and opportunities for improving scenarios to increase the strength of experimental effects.

  • The paper introduces a new framework using causal graphs to evaluate moral reasoning in LLMs, termed the OffTheRails benchmark.

  • Through procedural generation, the study creates diverse moral dilemmas, focusing on variables like causal structure, evitability, and actions to test LLMs.

  • Experiments show that both humans and models respond to changes in causal structure and evitability, revealing consistent patterns in moral judgments.

  • The findings assist in enhancing the moral sensitivities of LLMs, with implications for their integration in applications like autonomous vehicles and personalized healthcare.

Evaluating Moral Reasoning in Language Models Using Procedurally Generated Dilemmas


The integration of LLMs into decision-making processes underscores the importance of these models possessing robust moral reasoning capabilities. This paper explores the use of systematic evaluations to probe the moral reasoning of LLMs through a novel framework that utilizes causal graphs to generate moral dilemmas, termed the OffTheRails benchmark.

Methodology Overview

The methodology hinges on translating abstract causal graphs into prompt templates which are populated and expanded by language models to create diverse sets of moral dilemmas. This study zeroes in on three key variables:

  • Causal Structure: whether harm is a means to an end or a side effect.
  • Evitability: the inevitability of harm regardless of the agent’s actions.
  • Action: distinguishing between actions causing harm and failures to prevent harm.

The procedural generation of these dilemmas leverages language models for scalability, creating controlled, varied moral scenarios without the constraints of either rigid experimental vignettes or the uncontrolled naturalism of crowdsourced narratives.

Benchmark Creation

The OffTheRails benchmark includes 50 scenarios with 400 unique test items, using GPT-4 for item generation. Scenarios are crafted by initially generating a causal structure, which is then used to derive variations reflecting different combinations of the key variables. This structured approach addresses challenges with LLMs' inconsistency in distinguishing complex causal relationships by enforcing strict template adherence during the generation process.

Experiments and Findings

The investigation involves two key experiments:

  1. Balancing Moral Scenarios: Ensuring the harm and beneficial outcomes in scenarios are balanced to prevent overshadowing of other variables. This involved ratings from human participants to match levels of harm to corresponding goods effectively.
  2. Evaluating Moral Judgments: Both human participants and language models (GPT-4 and Claude-2) were tested for their moral judgments across different scenarios. The study reveals that both humans and models are sensitive to changes in the causal structure and evitability, but not significantly to whether an action or omission led to harm.

Significantly, the outcomes indicated consistent patterns where scenarios with avoidable, direct harm (means) led to harsher moral judgments and higher attributions of intention, aligning with established psychological findings.

Implications and Future Directions

The results serve both practical and theoretical advancements in AI ethics, particularly in honing the moral sensitivities of LLMs. The procedural generation model presents a scalable way to assess and enhance moral reasoning capabilities systematically. This has far-reaching implications for improving the integration of LLMs in sensitive applications, from autonomous vehicles to personalized AI in healthcare.

Despite the successes, the differentiation between means and side effects posed generation challenges, indicating an area for improvement in LLMs' handling of complex causal inferences. Future work could refine the templating process or explore more granular manipulations of the scenario variables to better understand the nuances of model-generated moral reasoning.


The study establishes a foundational approach for systematically evaluating and improving the moral reasoning of language models. By demonstrating the feasibility and effectiveness of using structured, procedurally generated dilemmas, it sets the stage for further research into the ethical capabilities of AI systems, aiming for models that more accurately reflect nuanced human moral judgments.

