Reinforcement Learning from Diffusion Feedback: Q* for Image Search (2311.15648v1)
Abstract: Large vision-LLMs are steadily gaining personalization capabilities at the cost of fine-tuning or data augmentation. We present two models for image generation using model-agnostic learning that align semantic priors with generative capabilities. RLDF, or Reinforcement Learning from Diffusion Feedback, is a singular approach for visual imitation through prior-preserving reward function guidance. This employs Q-learning (with standard Q*) for generation and follows a semantic-rewarded trajectory for image search through finite encoding-tailored actions. The second proposed method, noisy diffusion gradient, is optimization driven. At the root of both methods is a special CFG encoding that we propose for continual semantic guidance. Using only a single input image and no text input, RLDF generates high-quality images over varied domains including retail, sports and agriculture showcasing class-consistency and strong visual diversity. Project website is available at https://infernolia.github.io/RLDF.
- GitHub - danielchyeh/ImageNet-100-Pytorch: (Pytorch) Training ResNets on ImageNet-100 data — github.com. https://github.com/danielchyeh/ImageNet-100-Pytorch, a. [Accessed 23-11-2023].
- GitHub - michaeltinsley/Gridworld-with-Q-Learning-Reinforcement-Learning-: Jupyter notebook containing a solution to Sutton and Barto’s gridworld problem with both a random agent and a Q-learning agent. — github.com. https://github.com/michaeltinsley/Gridworld-with-Q-Learning-Reinforcement-Learning-, b. [Accessed 23-11-2023].
- stabilityai/stable-diffusion · Negative Prompts — huggingface.co. https://huggingface.co/spaces/stabilityai/stable-diffusion/discussions/7857. [Accessed 23-11-2023].
- GitHub - nathen418/Vehicle-Tracking-Using-OpenCV-and-VOLOv5: In Development - A vehicle tracker written in Python using OpenCV and YOLOv5 – (VTUOV) — github.com. https://github.com/nathen418/Vehicle-Tracking-Using-OpenCV-and-VOLOv5/tree/main. [Accessed 23-11-2023].
- openai/clip-vit-large-patch14 · Hugging Face — huggingface.co. https://huggingface.co/openai/clip-vit-large-patch14. [Accessed 23-11-2023].
- GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image — github.com. https://github.com/openai/CLIP. [Accessed 23-11-2023].
- CompVis/stable-diffusion-v1-4 · Hugging Face — huggingface.co. https://huggingface.co/CompVis/stable-diffusion-v1-4. [Accessed 23-11-2023].
- stabilityai/stable-diffusion-2-1 · Hugging Face — huggingface.co. https://huggingface.co/stabilityai/stable-diffusion-2-1. [Accessed 23-11-2023].
- stabilityai/sd-vae-ft-mse-original · Hugging Face — huggingface.co. https://huggingface.co/stabilityai/sd-vae-ft-mse-original. [Accessed 23-11-2023].
- Wendy Kan Addison Howard, Eunbyung Park. Imagenet object localization challenge, 2018.
- A* search without expansions: Learning heuristic functions with deep q-networks. arXiv preprint arXiv:2102.04518, 2021.
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- Anonymous. Diffusion world models. In Submitted to The Twelfth International Conference on Learning Representations, 2023a. under review.
- Anonymous. Text-aware diffusion policies. In Submitted to The Twelfth International Conference on Learning Representations, 2023b. under review.
- Anonymous. DPO-diff: On discrete prompt optimization of text-to-image diffusion models. In Submitted to The Twelfth International Conference on Learning Representations, 2023c. under review.
- Anonymous. Reverse stable diffusion: What prompt was used to generate this image? In Submitted to The Twelfth International Conference on Learning Representations, 2023d. under review.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023.
- Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023.
- Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
- Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. arXiv preprint arXiv:2309.10150, 2023.
- Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
- Noam Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Reinforcement learning for fine-tuning text-to-image diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Concept sliders: Lora adaptors for precise control in diffusion models, 2023.
- Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
- Zubair Hassan. Car Detection using OpenCV and Python within 5 minutes! - Folio3AI Blog — folio3.ai. https://www.folio3.ai/blog/car-detection-using-opencv-and-python-within-5-minutes/. [Accessed 23-11-2023].
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021a.
- Natural adversarial examples. CVPR, 2021b.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Automata theory, languages, and computation. International Edition, 24(2):171–183, 2006.
- Learning to reach goals via diffusion. arXiv preprint arXiv:2310.02505, 2023.
- Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR, 2022.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- Dawn: vehicle detection in adverse weather nature dataset. arXiv preprint arXiv:2008.05402, 2020.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
- Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023.
- Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
- Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3317–3326, 2023.
- Francisco S Melo. Convergence of q-learning: a simple proof.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
- Zero-shot visual imitation. In ICLR, 2018.
- Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8011–8021, 2023.
- Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086, 2023a.
- Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023b.
- Reinforcement learning: An introduction. MIT press, 2018.
- TechVidvan Team. Vehicle Counting, Classification & Detection using OpenCV & Python - TechVidvan — techvidvan.com. https://techvidvan.com/tutorials/opencv-vehicle-detection-classification-counting/. [Accessed 23-11-2023].
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
- High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Q-learning. Machine learning, 8:279–292, 1992.
- Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.