Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcement Learning from Diffusion Feedback: Q* for Image Search (2311.15648v1)

Published 27 Nov 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.RO

Abstract: Large vision-LLMs are steadily gaining personalization capabilities at the cost of fine-tuning or data augmentation. We present two models for image generation using model-agnostic learning that align semantic priors with generative capabilities. RLDF, or Reinforcement Learning from Diffusion Feedback, is a singular approach for visual imitation through prior-preserving reward function guidance. This employs Q-learning (with standard Q*) for generation and follows a semantic-rewarded trajectory for image search through finite encoding-tailored actions. The second proposed method, noisy diffusion gradient, is optimization driven. At the root of both methods is a special CFG encoding that we propose for continual semantic guidance. Using only a single input image and no text input, RLDF generates high-quality images over varied domains including retail, sports and agriculture showcasing class-consistency and strong visual diversity. Project website is available at https://infernolia.github.io/RLDF.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. GitHub - danielchyeh/ImageNet-100-Pytorch: (Pytorch) Training ResNets on ImageNet-100 data — github.com. https://github.com/danielchyeh/ImageNet-100-Pytorch, a. [Accessed 23-11-2023].
  2. GitHub - michaeltinsley/Gridworld-with-Q-Learning-Reinforcement-Learning-: Jupyter notebook containing a solution to Sutton and Barto’s gridworld problem with both a random agent and a Q-learning agent. — github.com. https://github.com/michaeltinsley/Gridworld-with-Q-Learning-Reinforcement-Learning-, b. [Accessed 23-11-2023].
  3. stabilityai/stable-diffusion · Negative Prompts — huggingface.co. https://huggingface.co/spaces/stabilityai/stable-diffusion/discussions/7857. [Accessed 23-11-2023].
  4. GitHub - nathen418/Vehicle-Tracking-Using-OpenCV-and-VOLOv5: In Development - A vehicle tracker written in Python using OpenCV and YOLOv5 – (VTUOV) — github.com. https://github.com/nathen418/Vehicle-Tracking-Using-OpenCV-and-VOLOv5/tree/main. [Accessed 23-11-2023].
  5. openai/clip-vit-large-patch14 · Hugging Face — huggingface.co. https://huggingface.co/openai/clip-vit-large-patch14. [Accessed 23-11-2023].
  6. GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image — github.com. https://github.com/openai/CLIP. [Accessed 23-11-2023].
  7. CompVis/stable-diffusion-v1-4 · Hugging Face — huggingface.co. https://huggingface.co/CompVis/stable-diffusion-v1-4. [Accessed 23-11-2023].
  8. stabilityai/stable-diffusion-2-1 · Hugging Face — huggingface.co. https://huggingface.co/stabilityai/stable-diffusion-2-1. [Accessed 23-11-2023].
  9. stabilityai/sd-vae-ft-mse-original · Hugging Face — huggingface.co. https://huggingface.co/stabilityai/sd-vae-ft-mse-original. [Accessed 23-11-2023].
  10. Wendy Kan Addison Howard, Eunbyung Park. Imagenet object localization challenge, 2018.
  11. A* search without expansions: Learning heuristic functions with deep q-networks. arXiv preprint arXiv:2102.04518, 2021.
  12. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
  13. Anonymous. Diffusion world models. In Submitted to The Twelfth International Conference on Learning Representations, 2023a. under review.
  14. Anonymous. Text-aware diffusion policies. In Submitted to The Twelfth International Conference on Learning Representations, 2023b. under review.
  15. Anonymous. DPO-diff: On discrete prompt optimization of text-to-image diffusion models. In Submitted to The Twelfth International Conference on Learning Representations, 2023c. under review.
  16. Anonymous. Reverse stable diffusion: What prompt was used to generate this image? In Submitted to The Twelfth International Conference on Learning Representations, 2023d. under review.
  17. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  18. Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023.
  19. Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023.
  20. Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
  21. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
  22. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  23. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023.
  24. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  25. Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. arXiv preprint arXiv:2309.10150, 2023.
  26. Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
  27. Noam Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
  28. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  29. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  30. Reinforcement learning for fine-tuning text-to-image diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  31. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  32. Concept sliders: Lora adaptors for precise control in diffusion models, 2023.
  33. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
  34. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  35. Zubair Hassan. Car Detection using OpenCV and Python within 5 minutes! - Folio3AI Blog — folio3.ai. https://www.folio3.ai/blog/car-detection-using-opencv-and-python-within-5-minutes/. [Accessed 23-11-2023].
  36. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  37. Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
  38. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021a.
  39. Natural adversarial examples. CVPR, 2021b.
  40. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  41. Automata theory, languages, and computation. International Edition, 24(2):171–183, 2006.
  42. Learning to reach goals via diffusion. arXiv preprint arXiv:2310.02505, 2023.
  43. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR, 2022.
  44. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  45. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
  46. Dawn: vehicle detection in adverse weather nature dataset. arXiv preprint arXiv:2008.05402, 2020.
  47. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
  48. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023.
  49. Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023a.
  50. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  51. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  52. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  53. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  54. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  55. Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3317–3326, 2023.
  56. Francisco S Melo. Convergence of q-learning: a simple proof.
  57. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
  58. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  59. On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
  60. Zero-shot visual imitation. In ICLR, 2018.
  61. Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
  62. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  63. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  64. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  65. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  66. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  67. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  68. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  69. Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8011–8021, 2023.
  70. Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086, 2023a.
  71. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023b.
  72. Reinforcement learning: An introduction. MIT press, 2018.
  73. TechVidvan Team. Vehicle Counting, Classification & Detection using OpenCV & Python - TechVidvan — techvidvan.com. https://techvidvan.com/tutorials/opencv-vehicle-detection-classification-counting/. [Accessed 23-11-2023].
  74. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
  75. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  76. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
  77. Q-learning. Machine learning, 8:279–292, 1992.
  78. Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
  79. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
  80. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  81. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  82. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  83. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Aboli Marathe (7 papers)

Summary

  • The paper introduces RLDF, a reinforcement learning approach that uses diffusion feedback to generate semantically rich images without relying on text input.
  • It employs a novel semantic encoding via Context-Free Grammar and multi-type reward functions, achieving competitive FID and KID scores on an ImageNet clone.
  • The results demonstrate robust generalization and versatility across various domains, paving the way for efficient, adaptive image generation systems.

Reinforcement Learning from Diffusion Feedback: Semantic-driven Image Generation

In this essay, we analyze the paper "Reinforcement Learning from Diffusion Feedback: Q* for Image Search" authored by Aboli Marathe. This work addresses the challenge of generating semantically rich images by introducing novel approaches that leverage reinforcement learning (RL) in combination with model-agnostic learning paradigms. Specifically, the paper presents two methods: RLDF (Reinforcement Learning from Diffusion Feedback) and noisy diffusion gradient.

Introduction and Motivation

The recent advancements in text-to-image models, especially vision-LLMs (VLMs), have significantly improved the quality of image generation. However, these models often require extensive fine-tuning or human intervention to personalize the generated outputs. The paper tackles this limitation by presenting RLDF, which aims to generate diverse, semantically consistent images using only a single input image without any text guidance or additional data augmentation.

Methodology

The RLDF approach formulates the image generation task as a Markov Decision Process (MDP) where the agent navigates through an n-dimensional gridworld representing the semantic encoding space of images. This method employs Q-learning to maximize a reward function that aligns the generated images with the target semantics. The key components of RLDF are:

  1. Semantic Encoding: The paper introduces a novel encoding mechanism based on Context-Free Grammar (CFG) to compress the semantic elements of an image into a single vector. This encoding facilitates the RL agent to navigate the semantic space effectively.
  2. Reward Functions: RLDF employs three types of reward functions to guide the agent:
    • Multi-Semantic Reward: High rewards for matching semantic elements with the ground truth.
    • Partial-Semantic Reward: Rewards focused on matching the scene semantics.
    • CLIP Reward: Rewards based on CLIP embedding similarity between generated and ground truth images.
  3. Trajectory Learning: The agent begins in a random noise state and receives rewards based on the semantic alignment of the generated image with the target. The agent's actions lead to new semantic states, iteratively refining the generation process.

Additionally, the noisy diffusion gradient method computes gradients directly on the semantic encodings to optimize image generation. Though it lacks guaranteed convergence and may struggle under noisy signals, it offers an alternative optimization pathway.

Results

The paper evaluates the RLDF model extensively across various domains, demonstrating its versatility and robustness. Notable results include:

  • ImageNet Cloning: RLDF generated a synthetic ImageNet clone with approximately 1.5 million images across 1000 classes, achieving high semantic fidelity.
  • Generalization: The model showed strong generalization capabilities across different object classes and action spaces, producing semantically diverse and photorealistic images.
  • Evaluation Metrics: The RLDF-generated ImageNet clone achieved competitive FID and KID scores, indicating its effectiveness compared to existing baselines.

Implications and Future Work

The RLDF approach presents significant implications for both practical applications and theoretical advancements in the field of AI-driven image generation. By eliminating the need for text input and fine-tuning, this method opens avenues for more efficient and adaptable image generation systems. Future work could explore:

  • Integration with Advanced TTI Models: Enhancing RLDF with more advanced text-to-image models to further improve generation quality.
  • Computational Efficiency: Addressing the computational costs associated with larger environments and exploring optimization techniques to reduce training time.
  • Subject Consistency: Investigating methods to enhance subject consistency while maintaining class-consistency.

Conclusion

The paper "Reinforcement Learning from Diffusion Feedback: Q* for Image Search" introduces a novel and effective approach for semantic-driven image generation. By leveraging reinforcement learning and diffusion feedback, RLDF generates high-quality, diverse images while mitigating traditional dependencies on text guidance and fine-tuning. This work signifies a meaningful contribution to the field, providing a foundation for future research and practical advancements in AI-based image generation technologies.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub