Theoretical guarantees on the best-of-n alignment policy (2401.01879v3)

Published 3 Jan 2024 in cs.LG, cs.CL, cs.IT, and math.IT

Abstract: A simple and effective method for the inference-time alignment and scaling test-time compute of generative models is best-of-$n$ sampling, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.

References (16)

Citations (24)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/YouJiacheng/status/1891792813412712906

https://twitter.com/abeirami/status/1886399043405795441

https://twitter.com/abeirami/status/1758266231008948260

https://twitter.com/abeirami/status/1781357701110374732

https://twitter.com/945051913/status/1742811039769469138

https://twitter.com/xaichuxue/status/1892493944564392409

Theoretical guarantees on the best-of-n alignment policy (2401.01879v3)

Summary

Related Papers

Tweets