Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Role of ImageNet Classes in Fréchet Inception Distance (2203.06026v3)

Published 11 Mar 2022 in cs.CV, cs.AI, cs.LG, cs.NE, and stat.ML

Abstract: Fr\'echet Inception Distance (FID) is the primary metric for ranking models in data-driven generative modeling. While remarkably successful, the metric is known to sometimes disagree with human judgement. We investigate a root cause of these discrepancies, and visualize what FID "looks at" in generated images. We show that the feature space that FID is (typically) computed in is so close to the ImageNet classifications that aligning the histograms of Top-$N$ classifications between sets of generated and real images can reduce FID substantially -- without actually improving the quality of results. Thus, we conclude that FID is prone to intentional or accidental distortions. As a practical example of an accidental distortion, we discuss a case where an ImageNet pre-trained FastGAN achieves a FID comparable to StyleGAN2, while being worse in terms of human evaluation.

Citations (174)

Summary

  • The paper demonstrates that FID's sensitivity to ImageNet classes can mislead evaluations by reflecting changes in class distribution rather than genuine image quality improvements.
  • It employs Grad-CAM visualizations and histogram matching to reveal the influence of Inception-V3 features aligned with ImageNet on FID outcomes.
  • The study proposes resampling generated data to match real-world class distributions, urging researchers to adopt more robust metrics alongside FID.

Analyzing the Impact of ImageNet Classes on the Fréchet Inception Distance (FID) Metric

The paper investigates the susceptibility of the Fréchet Inception Distance (FID), a widely used metric for evaluating generative models, to variations linked to ImageNet class distributions. It aims to discern how FID might reflect not genuine image quality improvements but rather alterations in class distribution that are unrelated to perceptual quality.

Key Findings and Insights

The authors first illustrate the sensitivity of FID to ImageNet classifications by leveraging Grad-CAM techniques. They demonstrate that FID’s feature space, derived from Inception-V3, closely aligns with ImageNet classifications. This alignment indicates that FID can be heavily influenced by any changes in the distribution of these ImageNet classes in generated datasets, rather than just changes in image quality.

Significant observations include:

  • Localization of Attention: Grad-CAM visualizations show that FID is often more responsive to regions of images aligned with ImageNet top classifications rather than the perceived quality or fidelity of the generated image as a whole.
  • Histogram Matching Impact: A controlled experiment matching the top-1 ImageNet class distribution between generated and real datasets yields a reduction in FID scores, suggesting the metric can be influenced without improving perceptual quality.
  • Resampling Strategy: The authors propose a resampling technique where the distribution of generated samples is adjusted to further align with ImageNet class distributions seen in real datasets. This leads to drastic reductions in FID, although other perceptual measures do not show corresponding improvements, illustrating a significant perceptual null space in FID.

Implications for AI and Generative Models

The practical implications are profound. The research indicates that researchers should use FID cautiously, particularly when comparing models of different architectures or when pre-trained discriminators are used. The dependency of FID on ImageNet class distributions suggests that generative models might be optimized for lower FID scores without actual improvements in image quality.

Specifically, the use of ImageNet pre-trained components in GANs might artificially enhance FID by better replicating ImageNet-like distributions rather than achieving true visual realism. This effect complicates the use of FID as a sole measure of performance or improvement, urging the incorporation of alternative metrics such as CLIP-based distances, which are less biased towards ImageNet features.

Future Research Directions

The paper opens opportunities for future research on developing more robust and comprehensive metrics that mitigate the dependency on predefined class distributions. These metrics should ideally align closely with human perceptual evaluations and be resilient against man-made or inherent biases introduced by certain feature spaces.

Furthermore, the exploration into alternative architectures and datasets for comparative analysis might reveal new insights into both metric reliability and model performance under varied conditions.

In conclusion, by highlighting the critical role of embedded class distributions in determining FID scores, this paper offers valuable guidance for researchers to enhance their evaluation frameworks for generative models, ensuring more reliable measurement of genuine model improvements.

Github Logo Streamline Icon: https://streamlinehq.com