HPBench: Human & HPO Benchmarks
- HPBench is a benchmark suite comprising two main axes: human perception of AI-generated images and large-scale, reproducible black-box hyperparameter optimization (HPO) evaluations.
- In the human perception study, controlled evaluations demonstrate that humans achieve around 61.3% accuracy while automated models reach up to 87% accuracy, highlighting critical detection challenges.
- The HPO-B variant provides a reproducible framework with millions of evaluations across 176 algorithms and 196 datasets, facilitating rigorous performance comparisons in HPO research.
HPBench is a term used for distinct benchmark suites in the machine learning literature, each tailored to fundamentally different problems but unified by rigorous evaluation protocols and focus on human or algorithmic performance. The two principal applications of the HPBench name are: (1) human perception of AI-generated images, and (2) large-scale reproducible benchmarks for black-box hyperparameter optimization (HPO), the latter referred to as "HPO-B" in some sources. The following exposition focuses on these two axes, their construction, protocols, evaluation measures, and major findings, referencing foundational works as appropriate.
1. Human Perception Benchmark: Foundations and Objectives
HPBench as introduced in "Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images" (Lu et al., 2023) is established to quantify the capability of human subjects to discriminate state-of-the-art AI-generated images from natural photographs. This arises from a critical need to empirically monitor human vulnerability as generative models (GANs, diffusion, autoregressive) approach photorealism and thus pose risks for misinformation and authenticity crises.
Complementing the model-centric MPBench, HPBench is specifically structured for controlled, large-scale human evaluation, replacing legacy small-scale or ad hoc protocols that lack statistical reliability for modern photorealistic AIGC.
2. Dataset Construction and Sampling
The image corpus underpinning HPBench is a subset curated from the larger Fake2M dataset, which comprises approximately 2 million AI-generated images, each generated by either text-to-image diffusion models (e.g., Stable Diffusion v1.5 Realistic Vision V2.0, IF v1.0), or GANs (e.g., StyleGAN3 on FFHQ, MetFaces, AFHQv2), as well as real-photo counterparts from the Conceptual Captions (CC3M) and other domain-matched datasets.
For the human evaluation, strict filtering was applied:
- Only high-quality, near-photorealistic AI-generated samples passed expert review (obvious artifacts discarded).
- Real and AI images were matched by prompt (e.g., "portrait of a woman, 8 K HDR, photographic, very detailed") to enforce class-conditional balance.
- Eight semantic categories were created: Multiperson, Landscape, Man, Woman, Record, Plant, Animal, Object.
- The HPBench set comprised 151 AI images and 244 real images, with per-category balancing (see Table 1).
| Category | AI Images | Real Images |
|---|---|---|
| Multiperson | 10 | 12 |
| Landscape | 27 | 26 |
| Man | 17 | 44 |
| Woman | 30 | 49 |
| Record | 15 | 21 |
| Plant | 13 | 18 |
| Animal | 29 | 53 |
| Object | 10 | 21 |
3. Experimental Protocol and Evaluation Metrics
Human Evaluation Procedure
- Fifty participants (diverse in age and generative model familiarity) viewed 100 randomly interleaved images (50 real, 50 AI) from the curated set.
- For each image, subjects indicated "Real" vs. "AI-Generated". Upon "AI" choice, they attributed detection to one or more defect classes: Detail, Smooth, Blur, Color, Shadow/Light, Daub (smearing), Rationality, Intuition.
- Trials were proctored and no time limit was imposed, with average response time ≈18 s/image.
- Participants were blinded to the true class proportion.
Quantitative Metrics
Let = true positives (AI correctly labeled), = true negatives (real correctly labeled), = false positives (real labeled as AI), = false negatives (AI labeled as real). For images:
- Accuracy:
- Misclassification Rate:
- Precision:
- Recall:
- False Omission Rate (FOR):
Metrics are also computed per semantic category.
4. Empirical Results and Analysis
Human Performance
- Global accuracy: ( misclassification).
- Real images: correct. AI images: correct.
- Category-wise accuracies: Multiperson , Landscape , Object .
- AI-generation experience confers only a modest boost (+3.7 percentage points in AI image detection).
- When correct on "AI", most cited detectable artifacts are: Detail (), Smoothness (), Blur (), and Intuition ().
- Portraits (human faces and multiperson scenes) are detected significantly above random, while inanimate objects cause greater confusion.
Model versus Human Detection
- State-of-the-art automated models (best: ConvNext-S) achieve accuracy (i.e., error) on the same HPBench set, far outperforming humans.
- Model performance varies with architecture (ConvNet vs. CLIP), training data, and data augmentations.
- Automated detection is substantially more reliable than unaided human inspection under these conditions, but still nontrivial error persists on previously unseen generator settings.
5. Implications, Limitations, and Directions
The HPBench study concludes that as of 2023, high-quality AI-generated images deceive humans at approximately error rates—well above desirable thresholds for robust authenticity assessment. Models can reduce error by half but remain imperfect, especially when generalizing to new generative techniques or distributions.
Analysis suggests that gross visible artifacts (e.g., anatomical oddities, over-smoothing, logical inconsistencies) are critical for detection but as models improve, these cues diminish, further challenging human scrutiny.
Recommendations include:
- Developing generator-agnostic, robust detection models—possibly by freezing representation layers or combining model predictions with human-in-the-loop pipelines.
- Expanding both the dataset scope (covering more generators, domains) and evaluation cohorts (cross-disciplinary, cross-cultural).
- Exploring AIGC to synthetically create adversarial edge cases for detector training.
This body of work provides a critical quantitative baseline for both the evaluation of human visual reliability in the context of AIGC and for the design and assessment of automated detection systems (Lu et al., 2023).
6. HPBench (HPO-B): Large-Scale Black-Box Hyperparameter Optimization Benchmark
The term "HPBench" also appears as an alternative label for "HPO-B", a large-scale and reproducible testbed for black-box HPO, as discussed in "HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML" (Arango et al., 2021).
Benchmark Construction
- Constituents: search spaces (algorithms) over hyperparameter domains , each paired with OpenML datasets.
- Raw runs: For each , evaluated configurations sampled and recorded, yielding a sparse evaluation matrix , where stores the best observed accuracy.
- Scale: $176$ algorithms $196$ datasets, totaling approximately $6.4$ million evaluations.
Data Preprocessing
- Tasks with runs or duplicates are removed.
- Hyperparameters are canonicalized, one-hot encoded (categoricals), missing dimensions zero-imputed (indicator added), constant features dropped.
- Log-scale variables (e.g., learning rates) are transformed by .
- Features are finally rescaled to .
Benchmark Variants
- HPO-B-v1: full corpus for heterogeneous transfer.
- HPO-B-v2: $16$ frequent search spaces on $101$ datasets for non-transfer HPO.
- HPO-B-v3: v2 subset with fixed train/validation/test splits per algorithm for transfer HPO, with warm-start seeds for reproducibility.
Evaluation Protocol
- For each test task , algorithms are provided with five warm-start seeds (each five initial points).
- Evaluation strictly queries stored for the nearest pre-computed in the finite run set.
- A continuous surrogate (XGBoost model) enables interpolation for arbitrary .
Metrics
- Simple regret: (where is best recorded, worst).
- Normalized regret:
- Aggregation over test tasks and seeds—either mean regret or average rank.
- For transfer HPO, transfer gain: expresses simple regret improvement over from-scratch baselines.
Summary of Design Choices
- Scale and diversity: comprehensive coverage of algorithm/dataset pairs for classical tabular supervised learning.
- Sparsity: most pairs are under-sampled, mirroring real-world HPO data limitations.
- Full reproducibility: datasets, seeds, splits, and metrics are fixed and openly released.
- Extensibility: while focused on classical machine learning, surrogates enable extensions to continuous optimization; later work may target multi-fidelity HPO and deep learning (Arango et al., 2021).
7. Synoptic Comparison
| Aspect | Human Perception HPBench (Lu et al., 2023) | Black-Box HPO HPBench/HPO-B (Arango et al., 2021) |
|---|---|---|
| Domain | Human detection of AIGC images | Hyperparameter optimization |
| Scale | 395 images (151 fake, 244 real) × 50 subjects | 176 algorithms × 196 datasets, 6.4M evals |
| Data Structure | Human choices, per-image judgments | (θ,y) runs per (algo, dataset) |
| Metric examples | Accuracy, misclassification, FOR, category stats | Simple regret, normalized regret, transfer gain |
| Protocol | Controlled lab, blinded, explained judgments | Fixed seeds, splits, nearest-query protocol |
| Main finding | 61.3% human accuracy, 87% SOTA model accuracy | Enables fair benchmarking/reproducibility |
8. Conclusion
HPBench is a nomenclature associated with rigorous, scalable, and reproducible benchmark suites in machine learning research. In human perception, it exposes the limits of human ability to visually authenticate images in the face of progressing AIGC. In HPO, it operationalizes community standards for evaluating and comparing HPO algorithms with explicit protocols, metrics, and data organization. In both contexts, HPBench advances the empirical foundations upon which new algorithms and defensibility mechanisms can be built, and catalyzes further meta-research into human and algorithmic performance under evolving machine learning frontiers (Lu et al., 2023, Arango et al., 2021).