Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zeroth-Order Supervised Policy Improvement (2006.06600v2)

Published 11 Jun 2020 in cs.LG, cs.AI, and stat.ML

Abstract: Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hao Sun (383 papers)
  2. Ziping Xu (15 papers)
  3. Yuhang Song (36 papers)
  4. Meng Fang (100 papers)
  5. Jiechao Xiong (21 papers)
  6. Bo Dai (245 papers)
  7. Bolei Zhou (134 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.