Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 94 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Directly Attention Loss Adjusted Prioritized Experience Replay (2311.14390v1)

Published 24 Nov 2023 in cs.LG and cs.AI

Abstract: Prioritized Experience Replay (PER) enables the model to learn more about relatively important samples by artificially changing their accessed frequencies. However, this non-uniform sampling method shifts the state-action distribution that is originally used to estimate Q-value functions, which brings about the estimation deviation. In this article, an novel off policy reinforcement learning training framework called Directly Attention Loss Adjusted Prioritized Experience Replay (DALAP) is proposed, which can directly quantify the changed extent of the shifted distribution through Parallel Self-Attention network, so as to accurately compensate the error. In addition, a Priority-Encouragement mechanism is designed simultaneously to optimize the sample screening criterion, and further improve the training efficiency. In order to verify the effectiveness and generality of DALAP, we integrate it with the value-function based, the policy-gradient based and multi-agent reinforcement learning algorithm, respectively. The multiple groups of comparative experiments show that DALAP has the significant advantages of both improving the convergence rate and reducing the training variance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. L. Lin, “Self-improving reactive agents based on reinforcement learning, planning and teaching,” Machine Learning, vol. 8, no. 3, pp. 293–321, 1992.
  2. T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in International Conference on Learning Representations, 2016.
  3. F. Bu, D. Chang, “Double prioritized state recycled experience replay,” in IEEE International Conference on Consumer Electronics, pp. 1–6, 2020.
  4. X. Tao, A. S. Hafid, “DeepSensing: a novel mobile crowdsensing framework with double deep q-network and prioritized experience replay,” IEEE Internet of Things Journal, vol. 7, no. 12, pp. 11547–11558, 2020.
  5. Y. Yue, B. Kang, “Offline prioritized experience replay,” arXiv:2306.05412, 2023.
  6. S. Fujimoto, D. Meger, and D. Precup “An equivalence between loss functions and non-uniform sampling in experience replay,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, vol. 33, pp. 14219–14230, 2020.
  7. B. Saglam, F. B. Mutlu, D. C. Cicek, S. S. Kozat, “Actor prioritized experience replay,” arXiv:2209.00532, 2022.
  8. Z. Y. Chen, H. P. Li, and R. Z. Wang, “Attention Loss Adjusted Prioritized Experience Replay,” arXiv:2309.06684, 2023.
  9. J. Gao, X. Li, W. Liu and J. Zhao, “Prioritized experience replay method based on experience reward,” in International Conference on Machine Learning and Intelligent Systems Engineering, pp. 214–219, 2021.
  10. A. Gruslys, W. Dabney, “The reactor: a fast and sample-efficient actor-critic agent for reinforcement learning,” arXiv:1704.04651, 2017.
  11. P. Sun, W. Zhou, and H. Li, “Attentive experience replay,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 4, pp. 5900–5907, 2020.
  12. M. Brittain, J. Bertram, X. Yang, and P. Wei, “Prioritized sequence experience replay,” arXiv:2002.12726, 2020.
  13. V. Mnih, K. Kavukcuoglu, D. Silver, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  14. J. Sharma, P. A. Andersen, O. C. Granmo and M. Goodwin, “Deep Q-Learning With Q-Matrix Transfer Learning for Novel Fire Evacuation Environment,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 12, pp. 7363–7381, 2021.
  15. P. Timothy, J. Hunt, “Continuous control with deep reinforcement learning,” in International Conference on Learning Representations, 2016.
  16. R. Yang, D. Wang, and J. Qiao, “Policy gradient adaptive critic design with Ddynamic prioritized experience replay for wastewater treatment process control,” IEEE Transactions on Industrial Informatics, vol. 18, no. 5, pp. 3150–3158, 2022.
  17. R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proceedings of the 31th International Conference on Neural Information Processing Systems, pp. 6379–6390, 2017.
  18. H. Zhang, H. Wang, and Z. Kan, “Exploiting transformer in sparse reward reinforcement learning for interpretable temporal logic motion planning,” IEEE Robotics and Automation Letters, vol. 8, no. 8, pp. 4831-4838, 2023.
  19. H. Zhao, J. Wu, Z. Li, W. Chen, and Z. Zheng, “Double sparse deep reinforcement learning via multilayer sparse coding and nonconvex regularized pruning,” IEEE Transactions on Cybernetics, vol. 53, no. 2, pp. 765-778, 2023.
  20. G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba“Openai gym,” arXiv:1606.01540, 2016.
  21. F. Rezazadeh, H. Chergui, L. Alonso, and C. Verikoukis, “Continuous Multi-objective Zero-touch Network Slicing via Twin Delayed DDPG and OpenAI Gym,” in IEEE Global Communications Conference, pp. 1-6, 2020.
Citations (1)

Summary

  • The paper introduces the DALAP framework, which directly quantifies distribution shifts to adjust the importance sampling parameter in PER.
  • The paper integrates a Parallel Self-Attention Network and a Priority-Encouragement mechanism to balance exploration and exploitation effectively.
  • The paper’s experiments demonstrate DALAP's superior convergence speed and reduced variance compared to conventional PER methods in diverse RL settings.

Directly Attention Loss Adjusted Prioritized Experience Replay

This essay provides an analysis of the paper titled "Directly Attention Loss Adjusted Prioritized Experience Replay" (2311.14390), focusing on the theoretical foundation and practical applications of the proposed DALAP algorithm in reinforcement learning environments.

Introduction to Prioritized Experience Replay (PER)

Prioritized Experience Replay (PER) alters the frequency of sampled experience transitions to focus learning on more significant samples by using the temporal-difference (TD) error as a prioritization criterion. This method can, however, shift the state-action distribution, leading to estimation bias in learning value functions. PER adjusts for this bias using an importance sampling weight parameter, β\beta, but this adjustment can introduce additional errors due to its predefined linear scheduling.

The DALAP Framework

Overview

The Directly Attention Loss Adjusted Prioritized Experience Replay (DALAP) framework advances existing PER methodologies by directly quantifying the distribution shift to adjust β\beta more accurately. The DALAP utilizes a Parallel Self-Attention Network (PSAN) and introduces a Priority-Encouragement mechanism for superior performance in reinforcement learning.

Theoretical Foundation

DALAP begins by establishing a theoretical relationship between the estimation error and the PER-induced distribution shift. It posits that the estimation error is positively correlated with the hyperparameter β\beta, responsible for error correction. The importance sampling weight function f(β)f(\beta) is demonstrated to be critical in reducing errors due to the shift:

f(β)=(min(P(j))P(i))βf(\beta) = \left(\frac{min(P(j))}{P(i)}\right)^\beta

This equation emphasizes the need for calibrating β\beta in response to the distribution's effect on learning dynamics.

Parallel Self-Attention Network

The PSAN is an innovative architectural component that assesses the distribution impact of PER by comparing randomized uniform sampling (RUS) and priority-based sampling (PS). It consists of two self-attention networks running in parallel to compute the similarity of state-action distributions processed through different sampling methods. Figure 1

Figure 1: Parallel Self-Attention Network.

This network quantifies the Similarity-Increment (Δi\Delta_i) caused by the priority-based sampling:

Δi=IpIt\Delta_i = I_p - I_t

Where IpI_p is the distribution similarity after priority sampling, and ItI_t is the inherent similarity from uniform sampling. The result, Δi\Delta_i, provides a refined measure of β\beta by representing the true shift in distribution caused by PER.

Priority-Encouragement Mechanism

The Priority-Encouragement (PE) mechanism addresses the diversity limitations inherent in existing PER variants by increasing the probability of sampling adjacent transitions to a high-priority goal with decaying confidence. The decay mechanism is embedded in PE to ensure computational feasibility and maintain performance efficiency:

pni=min(pnρi+pni,pn)p_{n-i} = min(p_n \rho^i + p_{n-i}, p_n)

Here, ρ\rho adapts over time, diminishing the priority encouragement as learning progresses, thereby balancing exploration and exploitation dynamics effectively.

Experimental Validation

Integration with Learning Algorithms

DALAP has been evaluated with various reinforcement learning algorithms such as DQN, DDPG, and MADDPG across environments like CartPole-v0 and multi-agent domains. Noteworthy results show it surpassing alternatives such as ALAP, LAP, and conventional PER frameworks in terms of convergence speed and reduced training variance.

Results

Experiments reveal DALAP's rapid convergence and stable steady-state performance with consistently lower variance: Figure 2

Figure 2

Figure 2

Figure 2: Mean rewards of different training frameworks integrated with DQN.

Figure 3

Figure 3

Figure 3

Figure 3: Mean rewards of different training frameworks integrated with DDPG.

Figure 4

Figure 4

Figure 4

Figure 4: Mean rewards of different training frameworks integrated with MADDPG.

These figures illustrate DALAP's superiority in achieving faster convergence metrics and enhancing learning stability, underpinning its generalized applicability across multiple learning paradigms.

Conclusion

DALAP introduces significant enhancements to the PER algorithm by accurately compensating for distribution-induced estimation errors and enriching sample prioritization strategies. Its application spans various reinforcement learning architectures, promising improvements in experimental environments by leveraging a novel attention-based mechanism. Future adaptations could explore further optimization of β\beta and integration in more complex, real-world engagement models.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.