Emergent Mind

Theoretical guarantees on the best-of-n alignment policy

(2401.01879)
Published Jan 3, 2024 in cs.LG , cs.CL , cs.IT , and math.IT

Abstract

A simple and effective method for the alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a base policy, and ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the base policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes. Finally, we propose a new estimator for the KL divergence and empirically show that it provides a tight approximation through a few examples.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  2. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30
  3. Reward Model Ensembles Help Mitigate Overoptimization
  4. Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking
  5. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp.  10835–10866. PMLR
  6. Compositional preference models for aligning LMs
  7. Measuring Goodhart’s law, April 2022 (Accessed on January 3, 2024). https://openai.com/research/measuring-goodharts-law.

  8. Controlled Decoding from Language Models
  9. WebGPT: Browser-assisted question-answering with human feedback
  10. Training language models to follow instructions with human feedback
  11. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  12. Training Language Models with Language Feedback at Scale
  13. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  14. Llama 2: Open Foundation and Fine-Tuned Chat Models
  15. FUDGE: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  3511–3535, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.276. https://aclanthology.org/2021.naacl-main.276.

  16. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations

Show All 16