Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 125 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 429 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation (2311.17216v2)

Published 28 Nov 2023 in cs.CV

Abstract: Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content, such as biased or harmful images. However, the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However, existing approaches cannot discover directions for arbitrary concepts, such as those related to inappropriate concepts. In this work, we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors, we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach, namely, for fair generation, safe generation, and responsible text-enhancing generation. Project page: \url{https://interpretdiffusion.github.io}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugliness? arXiv preprint arXiv:2305.18398, 2023.
  2. Debiasing vision-language models via biased prompts. arXiv preprint arXiv:2302.00070, 2023.
  3. Fair diffusion: Instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893, 2023.
  4. Scalable detection of offensive and non-compliant content/logo in product images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2247–2256, 2020.
  5. Erasing concepts from diffusion models. In Proceedings of the 2023 IEEE International Conference on Computer Vision, 2023.
  6. Unified concept editing in diffusion models. IEEE/CVF Winter Conference on Applications of Computer Vision, 2024.
  7. Towards robust prompts on vision-language models. arXiv preprint arXiv:2304.08479, 2023a.
  8. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023b.
  9. Discovering interpretable directions in the semantic latent space of diffusion models. arXiv preprint arXiv:2303.11073, 2023.
  10. Selective amnesia: A continual learning approach to forgetting in deep generative models. arXiv preprint arXiv:2305.10120, 2023.
  11. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  12. Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  13. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  14. Training-free style transfer emerges from h-space in diffusion models. arXiv preprint arXiv:2303.15403, 2023.
  15. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  16. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  17. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023.
  18. Diffusion models already have a semantic latent space. In The Eleventh International Conference on Learning Representations, 2023.
  19. Diffusegae: Controllable and high-fidelity image manipulation from disentangled representation. arXiv preprint arXiv:2307.05899, 2023.
  20. Do dall-e and flamingo understand each other? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1999–2010, 2023.
  21. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  22. Cones: Concept neurons in diffusion models for customized generation. arXiv preprint arXiv:2303.05125, 2023.
  23. An image is worth 1000 lies: Transferability of adversarial images across prompts on vision-language models. In The Twelfth International Conference on Learning Representations, 2023.
  24. Improving adversarial transferability via model alignment. arXiv preprint arXiv:2311.18495, 2023.
  25. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  26. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  27. Ores: Open-vocabulary responsible visual synthesis. arXiv preprint arXiv:2308.13785, 2023a.
  28. Degeneration-tuning: Using scrambled grid shield unwanted concepts from stable diffusion. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8900–8909, 2023b.
  29. Editing implicit assumptions in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7053–7061, 2023.
  30. Understanding the latent space of diffusion models through the lens of riemannian geometry. In Advances in Neural Information Processing Systems, 2023.
  31. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  32. Large image datasets: A pyrrhic win for computer vision? arXiv preprint arXiv:2006.16923, 2020.
  33. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10619–10629, 2022.
  34. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  35. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  36. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  38. Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1350–1361, 2022.
  39. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22522–22531, 2023.
  40. Freeu: Free lunch in diffusion u-net. arXiv preprint arXiv:2309.11497, 2023.
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  42. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  43. Linear spaces of meanings: Compositional structures in vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15395–15404, 2023.
  44. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  45. Infodiffusion: Representation learning using information maximizing diffusion models. arXiv preprint arXiv:2306.08757, 2023.
  46. Uncovering the disentanglement capability in text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2023.
  47. Iti-gen: Inclusive text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3969–3980, 2023a.
  48. Forget-me-not: Learning to forget in text-to-image diffusion models. arXiv preprint arXiv:2303.17591, 2023b.
  49. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023c.
  50. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images… for now. arXiv preprint arXiv:2310.11868, 2023d.
  51. Unsupervised representation learning from pre-trained diffusion probabilistic models. Advances in Neural Information Processing Systems, 35:22117–22130, 2022.
  52. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
Citations (14)

Summary

  • The paper introduces a novel self-supervised method to identify semantic latent directions in the diffusion model’s h-space for ethical image synthesis.
  • It demonstrates that manipulating these latent directions with concept and anti-concept vectors significantly reduces gender and racial biases while suppressing unsafe content.
  • The approach enables fine-grained compositional control, interpolation, and cross-domain generalization, suggesting a scalable solution for responsible text-to-image generation.

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

Introduction and Motivation

The paper addresses the challenge of controlling and interpreting the internal representations of text-to-image diffusion models, specifically focusing on responsible generation—mitigating biases and unsafe content. While diffusion models such as Stable Diffusion have demonstrated state-of-the-art performance in image synthesis, their tendency to generate inappropriate or biased content remains a significant concern. Prior work has attempted to filter prompts, fine-tune models, or use external classifiers, but these approaches either lack generality, require extensive human annotation, or degrade model performance. This work proposes a self-supervised method to discover interpretable latent directions in the semantic bottleneck (hh-space) of the U-Net architecture, enabling direct manipulation of ethical and semantic concepts without external supervision.

Methodology: Self-Discovery of Semantic Latent Directions

The core contribution is an optimization framework that identifies a concept vector in the hh-space corresponding to any user-defined attribute (e.g., gender, safety, age). The process involves:

  1. Data Synthesis: Generate images using a prompt containing the target concept (e.g., "a female face").
  2. Concept Vector Optimization: Freeze the pretrained diffusion model and iteratively optimize a latent vector cc in hh-space to minimize the reconstruction loss when generating the same image from a prompt with the concept removed (e.g., "a face"). The only learnable parameter is cc, which is forced to encode the missing semantic information. Figure 1

    Figure 1: Optimization framework for discovering a semantic vector for a given concept in the hh-space of Stable Diffusion.

This approach is agnostic to the concept and does not require labeled data or external classifiers. The learned vector generalizes across prompts and images, and can be linearly composed or interpolated for nuanced control.

Applications: Fair, Safe, and Responsible Generation

Fair Generation

The method enables fair image synthesis by sampling concept vectors (e.g., male/female) with equal probability during inference, ensuring balanced representation across societal groups for ambiguous prompts (e.g., "doctor"). Figure 2

Figure 2: Fair generation—balancing gender representation for the prompt "doctor" by sampling male/female concept vectors.

Empirical results on the Winobias benchmark demonstrate substantial reduction in gender and racial bias compared to both vanilla Stable Diffusion and state-of-the-art debiasing methods. The approach is robust to prompt variations and does not require retraining for new professions or attributes.

Safe Generation

For prompts with implicit or explicit references to unsafe content (e.g., nudity, violence), the method learns "anti-concept" vectors (e.g., anti-sexual, anti-violence) using negative prompts. These vectors are added during inference to suppress inappropriate content while maintaining prompt fidelity. Figure 3

Figure 3: Safe generation—using an anti-sexual concept vector to suppress nudity in images generated from ambiguous prompts.

Quantitative evaluation on the I2P benchmark shows that combining these safety vectors with existing safety mechanisms (e.g., SLD, ESD) yields further reductions in inappropriate content, with up to 40% relative improvement in nudity suppression.

Responsible Text-Enhancing Generation

The approach also enhances the model's ability to follow responsible instructions in prompts (e.g., "no violence"). By extracting relevant concepts from the prompt and activating the corresponding vectors during generation, the model more faithfully adheres to ethical constraints. Figure 4

Figure 4: Responsible text-enhancing generation—activating safety concepts from the prompt to improve adherence to responsible instructions.

Semantic Properties: Interpolation, Composition, and Generalization

Interpolation

Concept vectors can be scaled to interpolate the strength of an attribute in the generated image, enabling fine-grained control over semantic features. Figure 5

Figure 5: Concept interpolation—gradually increasing the strength of a concept vector modifies the image semantics smoothly.

Composition

Multiple concept vectors (e.g., gender, age, race) can be linearly combined to synthesize images with composite attributes, demonstrating the disentanglement and compositionality of the hh-space. Figure 6

Figure 6: Multiple concepts composition—linearly adding vectors for gender, age, and race yields images with corresponding semantics.

Generalization

Concept vectors learned from one domain (e.g., "running" from dog images) generalize to other domains (e.g., cats, humans), indicating that the discovered directions capture universal semantic properties. Figure 7

Figure 7: General semantic concepts—vectors learned for "running" and "glasses" generalize across objects and prompts.

Implementation Details and Trade-offs

  • Computational Requirements: The optimization is performed with the diffusion model frozen, requiring only gradient updates to the concept vector. Training typically converges within 10K steps on 1K synthesized images per concept.
  • Scalability: The method is scalable to arbitrary concepts and can be applied to realistic datasets (e.g., CelebA) or synthetic data.
  • Limitations: Composing many safety-related vectors can degrade image fidelity and semantic alignment. The approach is less sensitive to the number of training samples or prompt diversity, but extreme extrapolation of concept vectors may yield unintended artifacts.
  • Integration: The method is orthogonal to existing safety and debiasing techniques and can be combined for enhanced responsible generation.

Theoretical and Practical Implications

The findings provide evidence that ethical and semantic concepts are encoded in the internal representations of diffusion models and can be manipulated directly in the latent space. This opens avenues for interpretable, controllable, and responsible generative modeling without retraining or external supervision. The approach facilitates post-hoc editing, fairness interventions, and safety enforcement in deployed generative systems.

Future Directions

Potential extensions include:

  • Automated extraction of concept vectors for large-scale attribute libraries.
  • Dynamic adjustment of concept strengths based on user or regulatory requirements.
  • Application to multimodal and cross-domain generative models.
  • Further analysis of the geometry and disentanglement properties of the hh-space.

Conclusion

This work introduces a self-supervised framework for discovering and manipulating interpretable latent directions in diffusion models, enabling responsible text-to-image generation. The approach achieves strong empirical results in fairness, safety, and prompt adherence, with minimal computational overhead and no reliance on external annotation. The demonstrated generalization and compositionality of concept vectors suggest promising directions for interpretable and ethical generative AI.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube