Measuring Intelligence through Games
Abstract: Artificial general intelligence (AGI) refers to research aimed at tackling the full problem of artificial intelligence, that is, create truly intelligent agents. This sets it apart from most AI research which aims at solving relatively narrow domains, such as character recognition, motion planning, or increasing player satisfaction in games. But how do we know when an agent is truly intelligent? A common point of reference in the AGI community is Legg and Hutter's formal definition of universal intelligence, which has the appeal of simplicity and generality but is unfortunately incomputable. Games of various kinds are commonly used as benchmarks for "narrow" AI research, as they are considered to have many important properties. We argue that many of these properties carry over to the testing of general intelligence as well. We then sketch how such testing could practically be carried out. The central part of this sketch is an extension of universal intelligence to deal with finite time, and the use of sampling of the space of games expressed in a suitably biased game description language.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, concrete list of gaps and unresolved issues that future researchers could address to make the proposed game-based intelligence measure rigorous, fair, and practical.
- Precise operationalization of the computable environment complexity : how to measure description length l(μ) when the GDL requires an interpreter; whether interpreter size counts; how to include adversary code; and how to standardize runtime τ(μ) when it depends on agent actions and hardware.
- Hardware invariance of “time” measurements: what exact unit (e.g., CPU instructions, instruction-retired counters) is used; how to normalize across languages, JITs, GPUs, and distributed setups to prevent unfair speed differences.
- Sampling games proportional to : practical algorithms to enumerate and sample valid GDL strings, compute their , and avoid intractable rejection rates due to invalid or degenerate descriptions.
- Formal criteria and automated tests for “meaningfulness” of sampled games: a measurable “skill differentiation” metric, thresholds for trivial/unsolvable games, and pre-testing protocols to filter games before inclusion.
- Reward normalization across heterogeneous games: a principled, universal mapping of game-specific scores and bounded rewards to a common scale that is robust to differences in episode length, scoring granularity, and stochasticity.
- Treatment of stochastic games and reward variance: standard replication counts per game, confidence intervals for agent performance, and variance-normalization schemes to ensure reliable comparisons.
- Two-phase (learning vs evaluation) switching policy: precise rules to prevent exploitation (e.g., sandbagging), whether switching can be automated by the environment, and how to handle agents that never switch or switch too early.
- Time-budget scheduling (): how to choose , the distribution over budgets, and per-game time caps to balance planning vs learning agents without biasing results toward one class of algorithms.
- Agent–environment I/O standardization: concrete specifications for observation/action encodings (symbolic vs pixel/continuous), latency handling in asynchronous environments, and protocols for “pass” actions to penalize slow agents fairly.
- Multi-agent evaluation design: methods to incorporate adversary complexity into without collapsing game weight, protocols for opponent selection (fixed, adaptive, human, self-play), and handling non-transitive agent matchups across games.
- Cross-game ranking for multi-player settings: how to build a unified ladder (e.g., Elo, TrueSkill) when games differ in variance, scoring, and player counts; calibration procedures to avoid misleading cross-game comparisons.
- GDL design specifics: concrete grammar and interpreter choices to achieve expressiveness (continuous states, noise, partial observability, physics), compactness, and high validity rate; decidability of validity checking and error reporting for malformed strings.
- Inclusion of high-fidelity engines (e.g., Unity/Unreal): how to package engine complexity into , manage asset dependencies, guarantee reproducibility, and avoid skewing the sampling toward a few large interpreters.
- Memory and other resource constraints: the measure only integrates time; clear methods are needed to account for memory, model size, energy, and storage, and to study trade-offs between resources and performance.
- Statistical reliability of the Monte Carlo “anytime” estimate: target sample sizes per agent, stopping rules, error bounds, convergence guarantees, and sensitivity analysis to the weighting and GDL biases.
- Robustness to exploit strategies beyond “hyperactivity”: defenses against agents that learn the sampling distribution and overfit; introspect the GDL to shortcut learning; cache game-specific policies; or exploit switching mechanics.
- Guaranteeing finiteness of episodes: explicit mechanisms to enforce termination (max steps, timeouts), policies when agents stall or environments loop, and how such safeguards affect and sampling weights.
- Decision on whether agents receive the game specification: clearly defined tracks (with/without GDL provided), and fairness implications for reasoning-heavy vs learning-heavy agents.
- Empirical validation plan: concrete benchmarks, baselines (e.g., random, tabular RL, model-based RL, planning), and human calibration to assess whether the measure correlates with intuitive general intelligence.
- Coverage metrics for cognitive diversity: tools to quantify which competencies (e.g., perception, planning, memory, social reasoning, language) are being tested by the sampled games, and mechanisms to correct under-represented areas.
- Complexity weighting trade-offs: justification for the chosen Levin-like trade-off between l(μ) and τ(μ); comparison with alternatives (e.g., Speed Prior) and empirical impact on sampling and rankings.
- Game reward pathologies: handling deceptive/poorly shaped rewards, multi-objective scoring, negative rewards, and games where maximizing score is misaligned with “intelligent” play; criteria to exclude or reweight such games.
- Reproducibility and licensing: legal and logistical constraints of distributing game interpreters, assets, and engines; standardized containers; and data/versioning protocols for consistent re-evaluation.
- Parallelization and scheduling: concrete designs for distributing evaluation across compute; load balancing for heterogeneous τ(μ); and ensuring identical conditions for agents in shared clusters.
- Non-game generality limits: explicit assessment of what cognitive faculties remain untested (e.g., natural language, long-horizon social interaction) and plans to extend or hybridize the framework to cover them.
Collections
Sign up for free to add this paper to one or more collections.