Emergent Mind

Beyond Concept Bottleneck Models: How to Make Black Boxes Intervenable?

(2401.13544)
Published Jan 24, 2024 in cs.LG and stat.ML

Abstract

Recently, interpretable machine learning has re-explored concept bottleneck models (CBM), comprising step-by-step prediction of the high-level concepts from the raw features and the target variable from the predicted concepts. A compelling advantage of this model class is the user's ability to intervene on the predicted concept values, affecting the model's downstream output. In this work, we introduce a method to perform such concept-based interventions on already-trained neural networks, which are not interpretable by design, given an annotated validation set. Furthermore, we formalise the model's intervenability as a measure of the effectiveness of concept-based interventions and leverage this definition to fine-tune black-box models. Empirically, we explore the intervenability of black-box classifiers on synthetic tabular and natural image benchmarks. We demonstrate that fine-tuning improves intervention effectiveness and often yields better-calibrated predictions. To showcase the practical utility of the proposed techniques, we apply them to deep chest X-ray classifiers and show that fine-tuned black boxes can be as intervenable and more performant than CBMs.

Overview

  • The paper presents a technique to enable concept-based human intervention in pre-trained black-box neural networks without requiring concept annotations during training.

  • Intervenability is introduced as a measure of how amenable a model is to concept-based interventions, which informs the fine-tuning of models.

  • A three-step intervention procedure is proposed, involving (1) training a probing function, (2) editing intermediate representations, and (3) updating the final model output.

  • Comparative studies demonstrate the method's effectiveness over established baselines in terms of intervention capability and calibration.

  • This research advances interpretable machine learning by allowing black-box models to become intervenable while preserving their architecture and initial learned representations.

Introduction

Interpretable machine learning has directed significant attention toward Concept Bottleneck Models (CBMs), which facilitate human intervention at the level of high-level attributes or concepts. This is particularly advantageous as it enables users to directly influence model predictions by editing these concept values. Nevertheless, a critical obstacle for CBMs is the necessity for concept knowledge and annotations at training time, which can be impractical or unattainable in many real-world scenarios.

Beyond Concept Bottleneck Models

A recent scholarly contribution addresses this challenge by presenting a technique to facilitate concept-based interventions in non-interpretable, pre-trained neural networks—all without requiring concept annotations during initial training. The work is a notable advancement, grounded in the idea of Intervenability as a new measure. It quantifies a model's amenability to concept-based interventions and serves as an effective tool to fine-tune black-box models to respond better to such interventions. A key premise is preserving the original model's architecture and learned representations, which is critical for knowledge transfer and maintaining performance across diverse tasks.

Methods and Contributions

The approach involves a three-step intervention procedure: firstly, training a probing function to map intermediate representations to concept values; secondly, editing these representations to echo the desired concept interventions; and thirdly, updating the final model output based on edited representations. Notably, this approach requires only a small, annotated validation set for probing purposes. By leveraging the formalized concept of Intervenability, the authors introduce a novel fine-tuning procedure that does not alter the model's architecture, indeed facilitating the adaptability of this strategy to diverse pre-trained neural networks.

The work reflects upon various fine-tuning paradigms and contrasting them with the proposed intervenability-driven method. These comparative studies cement the validity of the new approach, demonstrating improved intervention effectiveness and model calibration over common-sense baselines.

Empirical Evaluation

Extensive experiments on both synthetic and real-world datasets, such as chest X-ray classifiers, illustrate the practical implications of the proposed method. While CBMs demonstrate expected strength in scenarios where the data-generating process heavily depends on the concepts, the newly introduced fine-tuning strategy effectively rivals or even supersedes CBMs in more complex setups. This includes cases where concepts are not sufficient to fully capture the relationship between inputs and outputs.

Conclusion

This work represents a significant milestone in the field of interpretable machine learning, offering a compelling solution for enhancing the intervention capacities of opaque neural network models. The methods developed extend the practicality of intervenability measures to real-world applications, offering a mechanism to mediate between interpretability and performance while allowing the existing black-box models to benefit from human-expert interactions. This paper sets the stage for further exploration into optimal strategies for intervention and the integration of automated concept discovery, and its implications for the evaluation and refinement of large pre-trained models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.