Reinforcement Learning from Diffusion Feedback: Q* for Image Search (2311.15648v1)
Abstract: Large vision-LLMs are steadily gaining personalization capabilities at the cost of fine-tuning or data augmentation. We present two models for image generation using model-agnostic learning that align semantic priors with generative capabilities. RLDF, or Reinforcement Learning from Diffusion Feedback, is a singular approach for visual imitation through prior-preserving reward function guidance. This employs Q-learning (with standard Q*) for generation and follows a semantic-rewarded trajectory for image search through finite encoding-tailored actions. The second proposed method, noisy diffusion gradient, is optimization driven. At the root of both methods is a special CFG encoding that we propose for continual semantic guidance. Using only a single input image and no text input, RLDF generates high-quality images over varied domains including retail, sports and agriculture showcasing class-consistency and strong visual diversity. Project website is available at https://infernolia.github.io/RLDF.
- GitHub - danielchyeh/ImageNet-100-Pytorch: (Pytorch) Training ResNets on ImageNet-100 data — github.com. https://github.com/danielchyeh/ImageNet-100-Pytorch, a. [Accessed 23-11-2023].
- GitHub - michaeltinsley/Gridworld-with-Q-Learning-Reinforcement-Learning-: Jupyter notebook containing a solution to Sutton and Barto’s gridworld problem with both a random agent and a Q-learning agent. — github.com. https://github.com/michaeltinsley/Gridworld-with-Q-Learning-Reinforcement-Learning-, b. [Accessed 23-11-2023].
- stabilityai/stable-diffusion · Negative Prompts — huggingface.co. https://huggingface.co/spaces/stabilityai/stable-diffusion/discussions/7857. [Accessed 23-11-2023].
- GitHub - nathen418/Vehicle-Tracking-Using-OpenCV-and-VOLOv5: In Development - A vehicle tracker written in Python using OpenCV and YOLOv5 – (VTUOV) — github.com. https://github.com/nathen418/Vehicle-Tracking-Using-OpenCV-and-VOLOv5/tree/main. [Accessed 23-11-2023].
- openai/clip-vit-large-patch14 · Hugging Face — huggingface.co. https://huggingface.co/openai/clip-vit-large-patch14. [Accessed 23-11-2023].
- GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image — github.com. https://github.com/openai/CLIP. [Accessed 23-11-2023].
- CompVis/stable-diffusion-v1-4 · Hugging Face — huggingface.co. https://huggingface.co/CompVis/stable-diffusion-v1-4. [Accessed 23-11-2023].
- stabilityai/stable-diffusion-2-1 · Hugging Face — huggingface.co. https://huggingface.co/stabilityai/stable-diffusion-2-1. [Accessed 23-11-2023].
- stabilityai/sd-vae-ft-mse-original · Hugging Face — huggingface.co. https://huggingface.co/stabilityai/sd-vae-ft-mse-original. [Accessed 23-11-2023].
- Wendy Kan Addison Howard, Eunbyung Park. Imagenet object localization challenge, 2018.
- A* search without expansions: Learning heuristic functions with deep q-networks. arXiv preprint arXiv:2102.04518, 2021.
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- Anonymous. Diffusion world models. In Submitted to The Twelfth International Conference on Learning Representations, 2023a. under review.
- Anonymous. Text-aware diffusion policies. In Submitted to The Twelfth International Conference on Learning Representations, 2023b. under review.
- Anonymous. DPO-diff: On discrete prompt optimization of text-to-image diffusion models. In Submitted to The Twelfth International Conference on Learning Representations, 2023c. under review.
- Anonymous. Reverse stable diffusion: What prompt was used to generate this image? In Submitted to The Twelfth International Conference on Learning Representations, 2023d. under review.
- Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
- Efficient online reinforcement learning with offline data. arXiv preprint arXiv:2302.02948, 2023.
- Leaving reality to imagination: Robust classification via generated datasets. arXiv preprint arXiv:2302.02503, 2023.
- Richard Bellman. A markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
- Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
- Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
- Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pages 5253–5270, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions. arXiv preprint arXiv:2309.10150, 2023.
- Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023.
- Noam Chomsky. Three models for the description of language. IRE Transactions on information theory, 2(3):113–124, 1956.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Reinforcement learning for fine-tuning text-to-image diffusion models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Concept sliders: Lora adaptors for precise control in diffusion models, 2023.
- Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
- Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
- Zubair Hassan. Car Detection using OpenCV and Python within 5 minutes! - Folio3AI Blog — folio3.ai. https://www.folio3.ai/blog/car-detection-using-opencv-and-python-within-5-minutes/. [Accessed 23-11-2023].
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Is synthetic data from generative models ready for image recognition? arXiv preprint arXiv:2210.07574, 2022.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021a.
- Natural adversarial examples. CVPR, 2021b.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Automata theory, languages, and computation. International Edition, 24(2):171–183, 2006.
- Learning to reach goals via diffusion. arXiv preprint arXiv:2310.02505, 2023.
- Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR, 2022.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- Dawn: vehicle detection in adverse weather nature dataset. arXiv preprint arXiv:2008.05402, 2020.
- Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021.
- Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22691–22702, 2023.
- Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023a.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
- Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3317–3326, 2023.
- Francisco S Melo. Convergence of q-learning: a simple proof.
- Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6038–6047, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- On aliased resizing and surprising subtleties in gan evaluation. In CVPR, 2022.
- Zero-shot visual imitation. In ICLR, 2018.
- Aligning text-to-image diffusion models with reward backpropagation. arXiv preprint arXiv:2310.03739, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Fake it till you make it: Learning transferable representations from synthetic imagenet clones. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8011–8021, 2023.
- Understanding and mitigating copying in diffusion models. arXiv preprint arXiv:2305.20086, 2023a.
- Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6048–6058, 2023b.
- Reinforcement learning: An introduction. MIT press, 2018.
- TechVidvan Team. Vehicle Counting, Classification & Detection using OpenCV & Python - TechVidvan — techvidvan.com. https://techvidvan.com/tutorials/opencv-vehicle-detection-classification-counting/. [Accessed 23-11-2023].
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
- High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Q-learning. Machine learning, 8:279–292, 1992.
- Christopher John Cornish Hellaby Watkins. Learning from delayed rewards. 1989.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Aboli Marathe (7 papers)