Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Limits of Assumption-free Tests for Algorithm Performance (2402.07388v3)

Published 12 Feb 2024 in math.ST, cs.LG, stat.ML, and stat.TH

Abstract: Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm $A$ at the problem of learning from a training set of size $n$, versus, how good is a particular fitted model produced by running $A$ on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm $A$ as a ``black box'' (i.e., we can only study the behavior of $A$ empirically), there is a fundamental limit on our ability to carry out inference on the performance of $A$, unless the number of available data points $N$ is many times larger than the sample size $n$ of interest. (On the other hand, evaluating the performance of a particular fitted model is easy as long as a holdout data set is available -- that is, as long as $N-n$ is not too small.) We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that this is not the case: the same hardness result still holds for the problem of evaluating the performance of $A$, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com