- The paper presents deep RL methods that reduce training data needs by up to an order of magnitude compared to traditional SAC approaches.
- It shows that both REDQ and MBPO, despite their model-free and model-based differences, achieve similar learning speeds and final performance levels.
- PETS-MPPI attains stable driving behavior faster than SAC, demonstrating a practical edge for rapid deployment in autonomous vehicle control.
Introduction to Data-Efficient Deep Reinforcement Learning
Deep reinforcement learning (RL) has become a significant player in the world of autonomous systems, particularly in advanced vehicle control necessary for autonomous driving. Unlike common optimization-based control methods or those relying on large data sets for imitation learning, RL offers a pathway to develop control strategies through interaction with the environment. Although powerful, traditional model-free reinforcement learning approaches, such as soft actor-critic (SAC), require extensive training data, making them unsuited for real-world applications. To tackle this issue, research has been directed towards more data-efficient deep reinforcement learning methods suitable for vehicle trajectory control.
Novel Approaches to Vehicle Control
Researchers have deployed three relatively recent data-efficient deep RL methods to vehicle trajectory control:
- Randomized Ensemble Double Q-learning (REDQ)
- Probabilistic Ensembles with Trajectory Sampling and Model Predictive Path Integral optimizer (PETS-MPPI)
- Model-Based Policy Optimization (MBPO)
The standard formulation of model-based RL, typically used in these approaches, demonstrated to be ill-suited for the specifics of trajectory control. In light of this, a novel model-based prediction approach was proposed. Instead of learning a complete state transition model, only vehicle dynamics are learned, and trajectory deviations are computed using prior knowledge. This division simplifies the learning task and enhances the reliability of the model's predictions.
Empirical Insights from Simulation
The evaluation of these RL methods on the CARLA simulator, a realistic urban driving environment, revealed that they offer performance on par with or better than SAC. However, more importantly, they require significantly less interaction data, sometimes by more than an order of magnitude. The key findings include:
- PETS-MPPI achieved stable driving behavior quicker than SAC, with comparatively lower final performance.
- REDQ and MBPO not only achieved similar final performance levels as SAC but they did so using considerably less data.
Model-Free and Model-Based Dichotomy
Interestingly, both REDQ and MBPO displayed similar learning speeds and asymptotic performance despite their different underlying frameworks — REDQ operates in a model-free setting, while MBPO employs a model-based approach with synthetic data rollouts enhancing the available dataset. SAC, on the other hand, despite eventually reaching a good control policy, needed substantially more training data to match the performance of these data-efficient models.
Concluding Remarks
This paper sheds light on the potential of data-efficient RL in automotive control, significantly reducing the volume of data needed for training without compromising performance. Importantly, these findings can propel the application of RL in areas where data collection is expensive or risky, by proposing a framework where less is more, achieving superior results with minimal data. The work not only moves us closer to the goal of autonomous driving but also opens up new pathways for data-efficient learning in other realms of engineering and technology.