HPBench: Human & HPO Benchmarks

Updated 9 March 2026

HPBench is a benchmark suite comprising two main axes: human perception of AI-generated images and large-scale, reproducible black-box hyperparameter optimization (HPO) evaluations.
In the human perception study, controlled evaluations demonstrate that humans achieve around 61.3% accuracy while automated models reach up to 87% accuracy, highlighting critical detection challenges.
The HPO-B variant provides a reproducible framework with millions of evaluations across 176 algorithms and 196 datasets, facilitating rigorous performance comparisons in HPO research.

HPBench is a term used for distinct benchmark suites in the machine learning literature, each tailored to fundamentally different problems but unified by rigorous evaluation protocols and focus on human or algorithmic performance. The two principal applications of the HPBench name are: (1) human perception of AI-generated images, and (2) large-scale reproducible benchmarks for black-box hyperparameter optimization (HPO), the latter referred to as "HPO-B" in some sources. The following exposition focuses on these two axes, their construction, protocols, evaluation measures, and major findings, referencing foundational works as appropriate.

1. Human Perception Benchmark: Foundations and Objectives

HPBench as introduced in "Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images" (Lu et al., 2023) is established to quantify the capability of human subjects to discriminate state-of-the-art AI-generated images from natural photographs. This arises from a critical need to empirically monitor human vulnerability as generative models (GANs, diffusion, autoregressive) approach photorealism and thus pose risks for misinformation and authenticity crises.

Complementing the model-centric MPBench, HPBench is specifically structured for controlled, large-scale human evaluation, replacing legacy small-scale or ad hoc protocols that lack statistical reliability for modern photorealistic AIGC.

2. Dataset Construction and Sampling

The image corpus underpinning HPBench is a subset curated from the larger Fake2M dataset, which comprises approximately 2 million AI-generated images, each generated by either text-to-image diffusion models (e.g., Stable Diffusion v1.5 Realistic Vision V2.0, IF v1.0), or GANs (e.g., StyleGAN3 on FFHQ, MetFaces, AFHQv2), as well as real-photo counterparts from the Conceptual Captions (CC3M) and other domain-matched datasets.

For the human evaluation, strict filtering was applied:

Only high-quality, near-photorealistic AI-generated samples passed expert review (obvious artifacts discarded).
Real and AI images were matched by prompt (e.g., "portrait of a woman, 8 K HDR, photographic, very detailed") to enforce class-conditional balance.
Eight semantic categories were created: Multiperson, Landscape, Man, Woman, Record, Plant, Animal, Object.
The HPBench set comprised 151 AI images and 244 real images, with per-category balancing (see Table 1).

Category	AI Images	Real Images
Multiperson	10	12
Landscape	27	26
Man	17	44
Woman	30	49
Record	15	21
Plant	13	18
Animal	29	53
Object	10	21

3. Experimental Protocol and Evaluation Metrics

Human Evaluation Procedure

Fifty participants (diverse in age and generative model familiarity) viewed 100 randomly interleaved images (50 real, 50 AI) from the curated set.
For each image, subjects indicated "Real" vs. "AI-Generated". Upon "AI" choice, they attributed detection to one or more defect classes: Detail, Smooth, Blur, Color, Shadow/Light, Daub (smearing), Rationality, Intuition.
Trials were proctored and no time limit was imposed, with average response time ≈18 s/image.
Participants were blinded to the true class proportion.

Quantitative Metrics

Let $TP$ = true positives (AI correctly labeled), $TN$ = true negatives (real correctly labeled), $FP$ = false positives (real labeled as AI), $FN$ = false negatives (AI labeled as real). For $N$ images:

Accuracy: $\mathrm{Acc} = \dfrac{TP + TN}{N}$
Misclassification Rate: $\mathrm{MisRate} = 1 - \mathrm{Acc} = \dfrac{FP + FN}{N}$
Precision: $P = \dfrac{TP}{TP + FP}$
Recall: $R = \dfrac{TP}{TP + FN}$
False Omission Rate (FOR): $\mathrm{FOR} = \dfrac{FN}{FN + TN}$

Metrics are also computed per semantic category.

4. Empirical Results and Analysis

Human Performance

Global accuracy: $61.34\,\%$ ( $38.66\,\%$ misclassification).
Real images: $66.9\,\%$ correct. AI images: $55.8\,\%$ correct.
Category-wise accuracies: Multiperson $67.5\,\%$ , Landscape $56.5\,\%$ , Object $50.8\,\%$ .
AI-generation experience confers only a modest boost (+3.7 percentage points in AI image detection).
When correct on "AI", most cited detectable artifacts are: Detail ( $28\%$ ), Smoothness ( $17\%$ ), Blur ( $12\%$ ), and Intuition ( $14\%$ ).
Portraits (human faces and multiperson scenes) are detected significantly above random, while inanimate objects cause greater confusion.

Model versus Human Detection

State-of-the-art automated models (best: ConvNext-S) achieve $87\,\%$ accuracy (i.e., $13\,\%$ error) on the same HPBench set, far outperforming humans.
Model performance varies with architecture (ConvNet vs. CLIP), training data, and data augmentations.
Automated detection is substantially more reliable than unaided human inspection under these conditions, but still nontrivial error persists on previously unseen generator settings.

5. Implications, Limitations, and Directions

The HPBench study concludes that as of 2023, high-quality AI-generated images deceive humans at approximately $40\%$ error rates—well above desirable thresholds for robust authenticity assessment. Models can reduce error by half but remain imperfect, especially when generalizing to new generative techniques or distributions.

Analysis suggests that gross visible artifacts (e.g., anatomical oddities, over-smoothing, logical inconsistencies) are critical for detection but as models improve, these cues diminish, further challenging human scrutiny.

Recommendations include:

Developing generator-agnostic, robust detection models—possibly by freezing representation layers or combining model predictions with human-in-the-loop pipelines.
Expanding both the dataset scope (covering more generators, domains) and evaluation cohorts (cross-disciplinary, cross-cultural).
Exploring AIGC to synthetically create adversarial edge cases for detector training.

This body of work provides a critical quantitative baseline for both the evaluation of human visual reliability in the context of AIGC and for the design and assessment of automated detection systems (Lu et al., 2023).

6. HPBench (HPO-B): Large-Scale Black-Box Hyperparameter Optimization Benchmark

The term "HPBench" also appears as an alternative label for "HPO-B", a large-scale and reproducible testbed for black-box HPO, as discussed in "HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML" (Arango et al., 2021).

Benchmark Construction

Constituents: $m$ search spaces (algorithms) over hyperparameter domains $\Theta_i\subset\mathbb{R}^{d_i}$ , each paired with $n$ OpenML datasets.
Raw runs: For each $(A_i,D_j)$ , evaluated configurations $(\theta,y)$ sampled and recorded, yielding a sparse evaluation matrix $M\in\mathbb{R}^{m\times n}$ , where $M_{i,j}$ stores the best observed accuracy.
Scale: $176$ algorithms $\times$ $196$ datasets, totaling approximately $6.4$ million evaluations.

Data Preprocessing

Tasks with $<5$ runs or duplicates are removed.
Hyperparameters are canonicalized, one-hot encoded (categoricals), missing dimensions zero-imputed (indicator added), constant features dropped.
Log-scale variables (e.g., learning rates) are transformed by $\log_{10}$ .
Features are finally rescaled to $[0,1]$ .

Benchmark Variants

HPO-B-v1: full corpus for heterogeneous transfer.
HPO-B-v2: $16$ frequent search spaces on $101$ datasets for non-transfer HPO.
HPO-B-v3: v2 subset with fixed train/validation/test splits per algorithm for transfer HPO, with warm-start seeds for reproducibility.

Evaluation Protocol

For each test task $(A_i, D_j)$ , algorithms are provided with five warm-start seeds (each five initial points).
Evaluation strictly queries stored $y_k$ for the nearest pre-computed $x_k$ in the finite run set.
A continuous surrogate (XGBoost model) enables interpolation for arbitrary $\theta$ .

Metrics

Simple regret: $r_T^{(i,j)} = f(A_i, \theta^*_T; D_j) - f^*_{i,j}$ (where $f^*_{i,j}$ is best recorded, $f^\dagger_{i,j}$ worst).
Normalized regret: $\tilde r_T^{(i,j)} = \dfrac{f(A_i, \theta^*_T; D_j) - f^*_{i,j}}{f^\dagger_{i,j} - f^*_{i,j}}$
Aggregation over test tasks and seeds—either mean regret or average rank.
For transfer HPO, transfer gain: $\Delta_{\mathrm{transfer}}(T)$ expresses simple regret improvement over from-scratch baselines.

Summary of Design Choices

Scale and diversity: comprehensive coverage of algorithm/dataset pairs for classical tabular supervised learning.
Sparsity: most $(A_i,D_j)$ pairs are under-sampled, mirroring real-world HPO data limitations.
Full reproducibility: datasets, seeds, splits, and metrics are fixed and openly released.
Extensibility: while focused on classical machine learning, surrogates enable extensions to continuous optimization; later work may target multi-fidelity HPO and deep learning (Arango et al., 2021).

7. Synoptic Comparison

Aspect	Human Perception HPBench (Lu et al., 2023)	Black-Box HPO HPBench/HPO-B (Arango et al., 2021)
Domain	Human detection of AIGC images	Hyperparameter optimization
Scale	395 images (151 fake, 244 real) × 50 subjects	176 algorithms × 196 datasets, 6.4M evals
Data Structure	Human choices, per-image judgments	(θ,y) runs per (algo, dataset)
Metric examples	Accuracy, misclassification, FOR, category stats	Simple regret, normalized regret, transfer gain
Protocol	Controlled lab, blinded, explained judgments	Fixed seeds, splits, nearest-query protocol
Main finding	61.3% human accuracy, 87% SOTA model accuracy	Enables fair benchmarking/reproducibility

8. Conclusion

HPBench is a nomenclature associated with rigorous, scalable, and reproducible benchmark suites in machine learning research. In human perception, it exposes the limits of human ability to visually authenticate images in the face of progressing AIGC. In HPO, it operationalizes community standards for evaluating and comparing HPO algorithms with explicit protocols, metrics, and data organization. In both contexts, HPBench advances the empirical foundations upon which new algorithms and defensibility mechanisms can be built, and catalyzes further meta-research into human and algorithmic performance under evolving machine learning frontiers (Lu et al., 2023, Arango et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images (2023)

HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HPBench.