Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation (2311.17216v2)

Published 28 Nov 2023 in cs.CV

Abstract: Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.

References (52)

Citations (14)

View on Semantic Scholar

Summary

The paper introduces a novel self-supervised method to identify semantic latent directions in the diffusion model’s h-space for ethical image synthesis.
It demonstrates that manipulating these latent directions with concept and anti-concept vectors significantly reduces gender and racial biases while suppressing unsafe content.
The approach enables fine-grained compositional control, interpolation, and cross-domain generalization, suggesting a scalable solution for responsible text-to-image generation.

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Introduction and Motivation

The paper addresses the challenge of controlling and interpreting the internal representations of text-to-image diffusion models, specifically focusing on responsible generation—mitigating biases and unsafe content. While diffusion models such as Stable Diffusion have demonstrated state-of-the-art performance in image synthesis, their tendency to generate inappropriate or biased content remains a significant concern. Prior work has attempted to filter prompts, fine-tune models, or use external classifiers, but these approaches either lack generality, require extensive human annotation, or degrade model performance. This work proposes a self-supervised method to discover interpretable latent directions in the semantic bottleneck ( $h$ -space) of the U-Net architecture, enabling direct manipulation of ethical and semantic concepts without external supervision.

Methodology: Self-Discovery of Semantic Latent Directions

The core contribution is an optimization framework that identifies a concept vector in the $h$ -space corresponding to any user-defined attribute (e.g., gender, safety, age). The process involves:

Data Synthesis: Generate images using a prompt containing the target concept (e.g., "a female face").
Concept Vector Optimization: Freeze the pretrained diffusion model and iteratively optimize a latent vector $c$ in $h$ -space to minimize the reconstruction loss when generating the same image from a prompt with the concept removed (e.g., "a face"). The only learnable parameter is $c$ , which is forced to encode the missing semantic information.
Figure 1: Optimization framework for discovering a semantic vector for a given concept in the $h$ -space of Stable Diffusion.

This approach is agnostic to the concept and does not require labeled data or external classifiers. The learned vector generalizes across prompts and images, and can be linearly composed or interpolated for nuanced control.

Applications: Fair, Safe, and Responsible Generation

Fair Generation

The method enables fair image synthesis by sampling concept vectors (e.g., male/female) with equal probability during inference, ensuring balanced representation across societal groups for ambiguous prompts (e.g., "doctor").

Figure 2: Fair generation—balancing gender representation for the prompt "doctor" by sampling male/female concept vectors.

Empirical results on the Winobias benchmark demonstrate substantial reduction in gender and racial bias compared to both vanilla Stable Diffusion and state-of-the-art debiasing methods. The approach is robust to prompt variations and does not require retraining for new professions or attributes.

Safe Generation

For prompts with implicit or explicit references to unsafe content (e.g., nudity, violence), the method learns "anti-concept" vectors (e.g., anti-sexual, anti-violence) using negative prompts. These vectors are added during inference to suppress inappropriate content while maintaining prompt fidelity.

Figure 3: Safe generation—using an anti-sexual concept vector to suppress nudity in images generated from ambiguous prompts.

Quantitative evaluation on the I2P benchmark shows that combining these safety vectors with existing safety mechanisms (e.g., SLD, ESD) yields further reductions in inappropriate content, with up to 40% relative improvement in nudity suppression.

Responsible Text-Enhancing Generation

The approach also enhances the model's ability to follow responsible instructions in prompts (e.g., "no violence"). By extracting relevant concepts from the prompt and activating the corresponding vectors during generation, the model more faithfully adheres to ethical constraints.

Figure 4: Responsible text-enhancing generation—activating safety concepts from the prompt to improve adherence to responsible instructions.

Semantic Properties: Interpolation, Composition, and Generalization

Interpolation

Concept vectors can be scaled to interpolate the strength of an attribute in the generated image, enabling fine-grained control over semantic features.

Figure 5: Concept interpolation—gradually increasing the strength of a concept vector modifies the image semantics smoothly.

Composition

Multiple concept vectors (e.g., gender, age, race) can be linearly combined to synthesize images with composite attributes, demonstrating the disentanglement and compositionality of the $h$ -space.

Figure 6: Multiple concepts composition—linearly adding vectors for gender, age, and race yields images with corresponding semantics.

Generalization

Concept vectors learned from one domain (e.g., "running" from dog images) generalize to other domains (e.g., cats, humans), indicating that the discovered directions capture universal semantic properties.

Figure 7: General semantic concepts—vectors learned for "running" and "glasses" generalize across objects and prompts.

Implementation Details and Trade-offs

Computational Requirements: The optimization is performed with the diffusion model frozen, requiring only gradient updates to the concept vector. Training typically converges within 10K steps on 1K synthesized images per concept.
Scalability: The method is scalable to arbitrary concepts and can be applied to realistic datasets (e.g., CelebA) or synthetic data.
Limitations: Composing many safety-related vectors can degrade image fidelity and semantic alignment. The approach is less sensitive to the number of training samples or prompt diversity, but extreme extrapolation of concept vectors may yield unintended artifacts.
Integration: The method is orthogonal to existing safety and debiasing techniques and can be combined for enhanced responsible generation.

Theoretical and Practical Implications

The findings provide evidence that ethical and semantic concepts are encoded in the internal representations of diffusion models and can be manipulated directly in the latent space. This opens avenues for interpretable, controllable, and responsible generative modeling without retraining or external supervision. The approach facilitates post-hoc editing, fairness interventions, and safety enforcement in deployed generative systems.

Future Directions

Potential extensions include:

Automated extraction of concept vectors for large-scale attribute libraries.
Dynamic adjustment of concept strengths based on user or regulatory requirements.
Application to multimodal and cross-domain generative models.
Further analysis of the geometry and disentanglement properties of the $h$ -space.

Conclusion

This work introduces a self-supervised framework for discovering and manipulating interpretable latent directions in diffusion models, enabling responsible text-to-image generation. The approach achieves strong empirical results in fairness, safety, and prompt adherence, with minimal computational overhead and no reliance on external annotation. The demonstrated generalization and compositionality of concept vectors suggest promising directions for interpretable and ethical generative AI.