Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity
This lightning talk explores a breakthrough methodology for estimating the size of closed-source language models using only API access. The research introduces Incompressible Knowledge Probes, a suite of 1,400 factual questions stratified by difficulty, that exploit a fundamental insight: certain facts must be explicitly stored in model parameters and cannot be compressed or inferred. By measuring factual recall capacity across seven tiers of knowledge rarity, the researchers demonstrate a remarkably linear relationship between knowledge capacity and parameter count, achieving parameter estimates within 2-3x accuracy for frontier models and falsifying the assumption that factual knowledge improves over time at fixed scale.Script
How do you measure what you cannot see inside? Closed-source language models hide their parameter counts behind API walls, but the researchers behind this work discovered that factual knowledge leaves an information-theoretic fingerprint that cannot be erased.
The Incompressible Knowledge Probe benchmark consists of 1,400 factual questions stratified into seven tiers by empirical difficulty. The bottom tiers test common knowledge that every model knows, while tier 7 probes facts so rare that even trillion parameter models fail completely.
When the authors calibrated this benchmark on 89 open-weight models spanning four orders of magnitude, they found a remarkably tight linear relationship between accuracy and the logarithm of parameter count, with an R-squared of 0.917. Leave-one-out validation showed 68 percent of estimates landed within 2x of true parameter count.
Here is where the incompressibility claim becomes concrete. Standard reasoning benchmarks show steady improvement over time at fixed parameter count, consistent with the Densing Law. But factual recall shows zero drift. The time coefficient is statistically indistinguishable from zero, with p-value less than 10 to the negative 15.
Beyond parameter estimation, the tier 5 and tier 6 answer sets create a knowledge fingerprint unique to each model. Hallucination similarity separates model pairs into three regimes: shared base models, post-trained lineage, and independent retrains, enabling lineage detection even when vendors do not disclose training provenance.
The tier 7 ceiling anchors the current knowledge frontier and ensures the benchmark will not saturate soon. Even the largest frontier models achieve near-zero accuracy at tier 7, leaving a long tail of rare facts that future scaling must earn, not shortcut. You can explore the full benchmark, code, and parameter estimates at EmergentMind.com, where you can also create videos like this one to dive deeper into the research.