Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 186 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

The Journey, Not the Destination: How Data Guides Diffusion Models (2312.06205v1)

Published 11 Dec 2023 in cs.CV and cs.LG

Abstract: Diffusion models trained on large datasets can synthesize photo-realistic images of remarkable quality and diversity. However, attributing these images back to the training data-that is, identifying specific training examples which caused an image to be generated-remains a challenge. In this paper, we propose a framework that: (i) provides a formal notion of data attribution in the context of diffusion models, and (ii) allows us to counterfactually validate such attributions. Then, we provide a method for computing these attributions efficiently. Finally, we apply our method to find (and evaluate) such attributions for denoising diffusion probabilistic models trained on CIFAR-10 and latent diffusion models trained on MS COCO. We provide code at https://github.com/MadryLab/journey-TRAK .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  2. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  3. Laion-5b: An open large-scale dataset for training next generation image-text models. In arXiv preprint arXiv:2210.08402, 2022.
  4. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), 2021.
  5. Diffusion art or digital forgery? investigating data replication in diffusion models. arXiv preprint arXiv:2212.03860, 2022.
  6. Modeldiff: A framework for comparing learning algorithms. In arXiv preprint arXiv:2211.12491, 2022.
  7. Measuring the effect of training data on deep learning predictions via randomized experiments. arXiv preprint arXiv:2206.10013, 2022.
  8. Interpreting black box predictions using fisher kernels. In The 22nd International Conference on Artificial Intelligence and Statistics, 2019.
  9. Class-action complaint against stability ai, 2023. URL https://stablediffusionlitigation.com/pdf/00201/1-1-stable-diffusion-complaint.pdf. Case 3:23-cv-00201.
  10. Getty Images. Getty images (us), inc. v. stability ai, inc, 2023. URL https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrlkmwnve/GETTY%20IMAGES%20AI%20LAWSUIT%20complaint.pdf. Case 1:23-cv-00135-UNA.
  11. Synthetic data from diffusion models improves imagenet classification. arXiv preprint arXiv:2304.08466, 2023.
  12. Invariant learning via diffusion dreamed distribution shifts. arXiv preprint arXiv:2211.10370, 2022.
  13. Discovering bugs in vision models using off-the-shelf image generation and captioning. arXiv preprint arXiv:2208.08831, 2022.
  14. Dataset interfaces: Diagnosing model failures using controllable counterfactual generation. arXiv preprint arXiv:2302.07865, 2023.
  15. Stable bias: Analyzing societal representations in diffusion models. In arXiv preprint arXiv:2303.11408, 2023.
  16. Analyzing bias in diffusion-based face generation models. In arXiv preprint arXiv:2305.06402, 2023.
  17. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017.
  18. Towards automatic concept-based explanations. arXiv preprint arXiv:1902.03129, 2019.
  19. Towards efficient data valuation based on the shapley value. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, 2019.
  20. Datamodels: Predicting predictions from training data. In International Conference on Machine Learning (ICML), 2022.
  21. Training data influence analysis and estimation: A survey. In arXiv preprint arXiv:2212.04612, 2022.
  22. Trak: Attributing model behavior at scale. In Arxiv preprint arXiv:2303.14186, 2023.
  23. Denoising diffusion probabilistic models. In Neural Information Processing Systems (NeurIPS), 2020.
  24. Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report, 2009.
  25. Microsoft coco: Common objects in context. In European conference on computer vision (ECCV), 2014.
  26. Influence sketching: Finding influential samples in large-scale regressions. In 2016 IEEE International Conference on Big Data (Big Data), 2016.
  27. Estimating training data influence by tracing gradient descent. In Neural Information Processing Systems (NeurIPS), 2020.
  28. Philip M Long. Properties of the after kernel. In arXiv preprint arXiv:2105.10585, 2021.
  29. More than a toy: Random matrix models predict how real-world neural representations generalize. In ICML, 2022.
  30. A kernel-based view of language model fine-tuning. In arXiv preprint arXiv:2210.05643, 2022.
  31. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
  32. Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems (NeurIPS), 2019.
  33. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  34. Consistent diffusion models: Mitigating sampling drift by learning to be consistent. arXiv preprint arXiv:2302.09057, 2023.
  35. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  36. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  37. Charles Spearman. The proof and measurement of association between two things. In The American Journal of Psychology, 1904.
  38. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems (NeurIPS), 2017.
  39. The unreasonable effectiveness of deep features as a perceptual metric. In Computer Vision and Pattern Recognition (CVPR), 2018.
  40. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.
  41. Second-order group influence functions for black-box predictions. In International Conference on Machine Learning (ICML), 2019.
  42. Lqf: Linear quadratic fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  43. Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8179–8186, 2022.
  44. If influence functions are the answer, then what is the question? In ArXiv preprint arXiv:2209.05364, 2022.
  45. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning (ICML), 2019.
  46. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 2881–2891, 2020.
  47. Representer point selection for explaining deep neural networks. In Neural Information Processing Systems (NeurIPS), 2018.
  48. Scalability vs. utility: Do we have to sacrifice one for the other in data importance quantification? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  49. Evaluating data attribution for text-to-image models. arXiv preprint arXiv:2306.09345, 2023.
  50. When do gans replicate? on the choice of dataset size. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6701–6710, 2021.
  51. Gerrit van den Burg and Chris Williams. On memorization in probabilistic deep generative models. Advances in Neural Information Processing Systems, 34:27916–27928, 2021.
  52. Dalle 2 pre-training mitigations. 2022.
  53. Extracting training data from diffusion models. arXiv preprint arXiv:2301.13188, 2023.
  54. Learning transferable visual models from natural language supervision. In arXiv preprint arXiv:2103.00020, 2021.
Citations (17)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.