Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discerning Temporal Difference Learning (2310.08091v2)

Published 12 Oct 2023 in cs.LG and cs.AI

Abstract: Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL), aimed at efficiently assessing a policy's value function. TD($\lambda$), a potent variant, incorporates a memory trace to distribute the prediction error into the historical context. However, this approach often neglects the significance of historical states and the relative importance of propagating the TD error, influenced by challenges such as visitation imbalance or outcome noise. To address this, we propose a novel TD algorithm named discerning TD learning (DTD), which allows flexible emphasis functions$-$predetermined or adapted during training$-$to allocate efforts effectively across states. We establish the convergence properties of our method within a specific class of emphasis functions and showcase its promising potential for adaptation to deep RL contexts. Empirical results underscore that employing a judicious emphasis function not only improves value estimation but also expedites learning across diverse scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Selective Dyna-Style Planning Under Limited Model Capacity. In Proceedings of the 37th International Conference on Machine Learning, volume 119, 1–10.
  2. Preferential Temporal Difference Learning. In Proceedings of the 38th International Conference on Machine Learning, volume 139, 286–296.
  3. Adaptive Algorithms and Stochastic Approximations, volume 22. Springer.
  4. Boyan, J. A. 2002. Technical Update: Least-Squares Temporal Difference Learning. Mach. Learn., 49(2-3): 233–246.
  5. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion. In Advances in Neural Information Processing Systems 31, 8234–8244.
  6. Selective Credit Assignment. CoRR, abs/2202.09699.
  7. An Equivalence between Loss Functions and Non-Uniform Sampling in Experience Replay. In Advances in Neural Information Processing Systems 33.
  8. An Off-policy Policy Gradient Theorem Using Emphatic Weightings. In Advances in Neural Information Processing Systems 31, 96–106.
  9. When to Trust Your Model: Model-Based Policy Optimization. In Advances in Neural Information Processing Systems 32, 12498–12509.
  10. Options of Interest: Temporal Abstraction with Interest Functions. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, 4444–4451.
  11. Adaptive Interest for Emphatic Reinforcement Learning. In NeurIPS.
  12. Ma, J. 2023. Distillation Policy Optimization. CoRR, abs/2302.00533.
  13. Continual Auxiliary Task Learning. In Advances in Neural Information Processing Systems 34, 12549–12562.
  14. Playing Atari with Deep Reinforcement Learning. CoRR, abs/1312.5602.
  15. Understanding and mitigating the limitations of prioritized experience replay. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, volume 180, 1561–1571.
  16. DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph., 37(4): 143.
  17. Prioritized Experience Replay. In 4th International Conference on Learning Representations.
  18. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, 1889–1897.
  19. High-Dimensional Continuous Control Using Generalized Advantage Estimation. In 4th International Conference on Learning Representations.
  20. Proximal Policy Optimization Algorithms. CoRR, abs/1707.06347.
  21. Sutton, R. S. 1988. Learning to Predict by the Methods of Temporal Differences. Mach. Learn., 3: 9–44.
  22. Reinforcement Learning: An Introduction. The MIT Press, second edition.
  23. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, volume 382, 993–1000.
  24. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning. J. Mach. Learn. Res., 17: 73:1–73:29.
  25. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In 10th International Conference on Autonomous Agents and Multiagent Systems, 761–768.
  26. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artif. Intell., 112(1-2): 181–211.
  27. An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control., 42(5): 674–690.
  28. Reinforcement Learning with Perturbed Rewards. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, 6202–6209.
  29. Yu, H. 2010. Convergence of Least Squares Temporal Difference Methods Under General Conditions. In Proceedings of the 27th International Conference on Machine Learning, 1207–1214.
  30. Convergence Results for Some Temporal Difference Methods Based on Least Squares. IEEE Trans. Autom. Control., 54(7): 1515–1531.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Jianfei Ma (6 papers)

Summary

We haven't generated a summary for this paper yet.