Papers
Topics
Authors
Recent
Search
2000 character limit reached

HPBench: Human & HPO Benchmarks

Updated 9 March 2026
  • HPBench is a benchmark suite comprising two main axes: human perception of AI-generated images and large-scale, reproducible black-box hyperparameter optimization (HPO) evaluations.
  • In the human perception study, controlled evaluations demonstrate that humans achieve around 61.3% accuracy while automated models reach up to 87% accuracy, highlighting critical detection challenges.
  • The HPO-B variant provides a reproducible framework with millions of evaluations across 176 algorithms and 196 datasets, facilitating rigorous performance comparisons in HPO research.

HPBench is a term used for distinct benchmark suites in the machine learning literature, each tailored to fundamentally different problems but unified by rigorous evaluation protocols and focus on human or algorithmic performance. The two principal applications of the HPBench name are: (1) human perception of AI-generated images, and (2) large-scale reproducible benchmarks for black-box hyperparameter optimization (HPO), the latter referred to as "HPO-B" in some sources. The following exposition focuses on these two axes, their construction, protocols, evaluation measures, and major findings, referencing foundational works as appropriate.

1. Human Perception Benchmark: Foundations and Objectives

HPBench as introduced in "Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images" (Lu et al., 2023) is established to quantify the capability of human subjects to discriminate state-of-the-art AI-generated images from natural photographs. This arises from a critical need to empirically monitor human vulnerability as generative models (GANs, diffusion, autoregressive) approach photorealism and thus pose risks for misinformation and authenticity crises.

Complementing the model-centric MPBench, HPBench is specifically structured for controlled, large-scale human evaluation, replacing legacy small-scale or ad hoc protocols that lack statistical reliability for modern photorealistic AIGC.

2. Dataset Construction and Sampling

The image corpus underpinning HPBench is a subset curated from the larger Fake2M dataset, which comprises approximately 2 million AI-generated images, each generated by either text-to-image diffusion models (e.g., Stable Diffusion v1.5 Realistic Vision V2.0, IF v1.0), or GANs (e.g., StyleGAN3 on FFHQ, MetFaces, AFHQv2), as well as real-photo counterparts from the Conceptual Captions (CC3M) and other domain-matched datasets.

For the human evaluation, strict filtering was applied:

  • Only high-quality, near-photorealistic AI-generated samples passed expert review (obvious artifacts discarded).
  • Real and AI images were matched by prompt (e.g., "portrait of a woman, 8 K HDR, photographic, very detailed") to enforce class-conditional balance.
  • Eight semantic categories were created: Multiperson, Landscape, Man, Woman, Record, Plant, Animal, Object.
  • The HPBench set comprised 151 AI images and 244 real images, with per-category balancing (see Table 1).
Category AI Images Real Images
Multiperson 10 12
Landscape 27 26
Man 17 44
Woman 30 49
Record 15 21
Plant 13 18
Animal 29 53
Object 10 21

3. Experimental Protocol and Evaluation Metrics

Human Evaluation Procedure

  • Fifty participants (diverse in age and generative model familiarity) viewed 100 randomly interleaved images (50 real, 50 AI) from the curated set.
  • For each image, subjects indicated "Real" vs. "AI-Generated". Upon "AI" choice, they attributed detection to one or more defect classes: Detail, Smooth, Blur, Color, Shadow/Light, Daub (smearing), Rationality, Intuition.
  • Trials were proctored and no time limit was imposed, with average response time ≈18 s/image.
  • Participants were blinded to the true class proportion.

Quantitative Metrics

Let TPTP = true positives (AI correctly labeled), TNTN = true negatives (real correctly labeled), FPFP = false positives (real labeled as AI), FNFN = false negatives (AI labeled as real). For NN images:

  • Accuracy: Acc=TP+TNN\mathrm{Acc} = \dfrac{TP + TN}{N}
  • Misclassification Rate: MisRate=1Acc=FP+FNN\mathrm{MisRate} = 1 - \mathrm{Acc} = \dfrac{FP + FN}{N}
  • Precision: P=TPTP+FPP = \dfrac{TP}{TP + FP}
  • Recall: R=TPTP+FNR = \dfrac{TP}{TP + FN}
  • False Omission Rate (FOR): FOR=FNFN+TN\mathrm{FOR} = \dfrac{FN}{FN + TN}

Metrics are also computed per semantic category.

4. Empirical Results and Analysis

Human Performance

  • Global accuracy: 61.34%61.34\,\% (38.66%38.66\,\% misclassification).
  • Real images: 66.9%66.9\,\% correct. AI images: 55.8%55.8\,\% correct.
  • Category-wise accuracies: Multiperson 67.5%67.5\,\%, Landscape 56.5%56.5\,\%, Object 50.8%50.8\,\%.
  • AI-generation experience confers only a modest boost (+3.7 percentage points in AI image detection).
  • When correct on "AI", most cited detectable artifacts are: Detail (28%28\%), Smoothness (17%17\%), Blur (12%12\%), and Intuition (14%14\%).
  • Portraits (human faces and multiperson scenes) are detected significantly above random, while inanimate objects cause greater confusion.

Model versus Human Detection

  • State-of-the-art automated models (best: ConvNext-S) achieve 87%87\,\% accuracy (i.e., 13%13\,\% error) on the same HPBench set, far outperforming humans.
  • Model performance varies with architecture (ConvNet vs. CLIP), training data, and data augmentations.
  • Automated detection is substantially more reliable than unaided human inspection under these conditions, but still nontrivial error persists on previously unseen generator settings.

5. Implications, Limitations, and Directions

The HPBench study concludes that as of 2023, high-quality AI-generated images deceive humans at approximately 40%40\% error rates—well above desirable thresholds for robust authenticity assessment. Models can reduce error by half but remain imperfect, especially when generalizing to new generative techniques or distributions.

Analysis suggests that gross visible artifacts (e.g., anatomical oddities, over-smoothing, logical inconsistencies) are critical for detection but as models improve, these cues diminish, further challenging human scrutiny.

Recommendations include:

  • Developing generator-agnostic, robust detection models—possibly by freezing representation layers or combining model predictions with human-in-the-loop pipelines.
  • Expanding both the dataset scope (covering more generators, domains) and evaluation cohorts (cross-disciplinary, cross-cultural).
  • Exploring AIGC to synthetically create adversarial edge cases for detector training.

This body of work provides a critical quantitative baseline for both the evaluation of human visual reliability in the context of AIGC and for the design and assessment of automated detection systems (Lu et al., 2023).


6. HPBench (HPO-B): Large-Scale Black-Box Hyperparameter Optimization Benchmark

The term "HPBench" also appears as an alternative label for "HPO-B", a large-scale and reproducible testbed for black-box HPO, as discussed in "HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML" (Arango et al., 2021).

Benchmark Construction

  • Constituents: mm search spaces (algorithms) over hyperparameter domains ΘiRdi\Theta_i\subset\mathbb{R}^{d_i}, each paired with nn OpenML datasets.
  • Raw runs: For each (Ai,Dj)(A_i,D_j), evaluated configurations (θ,y)(\theta,y) sampled and recorded, yielding a sparse evaluation matrix MRm×nM\in\mathbb{R}^{m\times n}, where Mi,jM_{i,j} stores the best observed accuracy.
  • Scale: $176$ algorithms ×\times $196$ datasets, totaling approximately $6.4$ million evaluations.

Data Preprocessing

  • Tasks with <5<5 runs or duplicates are removed.
  • Hyperparameters are canonicalized, one-hot encoded (categoricals), missing dimensions zero-imputed (indicator added), constant features dropped.
  • Log-scale variables (e.g., learning rates) are transformed by log10\log_{10}.
  • Features are finally rescaled to [0,1][0,1].

Benchmark Variants

  • HPO-B-v1: full corpus for heterogeneous transfer.
  • HPO-B-v2: $16$ frequent search spaces on $101$ datasets for non-transfer HPO.
  • HPO-B-v3: v2 subset with fixed train/validation/test splits per algorithm for transfer HPO, with warm-start seeds for reproducibility.

Evaluation Protocol

  • For each test task (Ai,Dj)(A_i, D_j), algorithms are provided with five warm-start seeds (each five initial points).
  • Evaluation strictly queries stored yky_k for the nearest pre-computed xkx_k in the finite run set.
  • A continuous surrogate (XGBoost model) enables interpolation for arbitrary θ\theta.

Metrics

  • Simple regret: rT(i,j)=f(Ai,θT;Dj)fi,jr_T^{(i,j)} = f(A_i, \theta^*_T; D_j) - f^*_{i,j} (where fi,jf^*_{i,j} is best recorded, fi,jf^\dagger_{i,j} worst).
  • Normalized regret: r~T(i,j)=f(Ai,θT;Dj)fi,jfi,jfi,j\tilde r_T^{(i,j)} = \dfrac{f(A_i, \theta^*_T; D_j) - f^*_{i,j}}{f^\dagger_{i,j} - f^*_{i,j}}
  • Aggregation over test tasks and seeds—either mean regret or average rank.
  • For transfer HPO, transfer gain: Δtransfer(T)\Delta_{\mathrm{transfer}}(T) expresses simple regret improvement over from-scratch baselines.

Summary of Design Choices

  • Scale and diversity: comprehensive coverage of algorithm/dataset pairs for classical tabular supervised learning.
  • Sparsity: most (Ai,Dj)(A_i,D_j) pairs are under-sampled, mirroring real-world HPO data limitations.
  • Full reproducibility: datasets, seeds, splits, and metrics are fixed and openly released.
  • Extensibility: while focused on classical machine learning, surrogates enable extensions to continuous optimization; later work may target multi-fidelity HPO and deep learning (Arango et al., 2021).

7. Synoptic Comparison

Aspect Human Perception HPBench (Lu et al., 2023) Black-Box HPO HPBench/HPO-B (Arango et al., 2021)
Domain Human detection of AIGC images Hyperparameter optimization
Scale 395 images (151 fake, 244 real) × 50 subjects 176 algorithms × 196 datasets, 6.4M evals
Data Structure Human choices, per-image judgments (θ,y) runs per (algo, dataset)
Metric examples Accuracy, misclassification, FOR, category stats Simple regret, normalized regret, transfer gain
Protocol Controlled lab, blinded, explained judgments Fixed seeds, splits, nearest-query protocol
Main finding 61.3% human accuracy, 87% SOTA model accuracy Enables fair benchmarking/reproducibility

8. Conclusion

HPBench is a nomenclature associated with rigorous, scalable, and reproducible benchmark suites in machine learning research. In human perception, it exposes the limits of human ability to visually authenticate images in the face of progressing AIGC. In HPO, it operationalizes community standards for evaluating and comparing HPO algorithms with explicit protocols, metrics, and data organization. In both contexts, HPBench advances the empirical foundations upon which new algorithms and defensibility mechanisms can be built, and catalyzes further meta-research into human and algorithmic performance under evolving machine learning frontiers (Lu et al., 2023, Arango et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HPBench.