NAVIX: Scaling MiniGrid Environments with JAX (2407.19396v1)

Published 28 Jul 2024 in cs.LG and cs.AI

Abstract: As Deep Reinforcement Learning (Deep RL) research moves towards solving large-scale worlds, efficient environment simulations become crucial for rapid experimentation. However, most existing environments struggle to scale to high throughput, setting back meaningful progress. Interactions are typically computed on the CPU, limiting training speed and throughput, due to slower computation and communication overhead when distributing the task across multiple machines. Ultimately, Deep RL training is CPU-bound, and developing batched, fast, and scalable environments has become a frontier for progress. Among the most used Reinforcement Learning (RL) environments, MiniGrid is at the foundation of several studies on exploration, curriculum learning, representation learning, diversity, meta-learning, credit assignment, and language-conditioned RL, and still suffers from the limitations described above. In this work, we introduce NAVIX, a re-implementation of MiniGrid in JAX. NAVIX achieves over 200 000x speed improvements in batch mode, supporting up to 2048 agents in parallel on a single Nvidia A100 80 GB. This reduces experiment times from one week to 15 minutes, promoting faster design iterations and more scalable RL model development.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that integrating JAX achieves up to 45-128x speed improvements and over 200,000x gains in batch mode.
The paper presents a scalable ECS-based design supporting 2048 parallel agents on a single GPU, drastically reducing training times.
The paper ensures reproducibility by mirroring MiniGrid's observations and transitions, enabling a drop-in replacement for consistent DRL experiments.

Overview of "NAVIX: Scaling MiniGrid Environments with JAX"

The paper "NAVIX: Scaling MiniGrid Environments with JAX" introduces NAVIX, a reimplementation of the popular MiniGrid environment suite using JAX. The motivation for this work stems from the need for more scalable and efficient environment simulations as the field of Deep Reinforcement Learning (DRL) progresses towards solving larger and more complex environments.

Key Contributions

Implementation Efficiency: NAVIX is designed to take full advantage of JAX's capabilities, including its just-in-time (JIT) compilation and vectorized computation (vmap), resulting in over 45x speed improvements on average compared to the original MiniGrid implementation. In batch mode, NAVIX can achieve over 200,000x speed improvements, dramatically reducing experiment times.
Scalability: On a single Nvidia A100 GPU with 80GB of memory, NAVIX can support up to 2048 parallel agents. This scalability is a significant step forward, allowing researchers to train thousands of agents in parallel, facilitating rapid experimentation and development of more robust DRL models.
Reproducibility: NAVIX environments mirror the original MiniGrid in terms of observation spaces, state transitions, rewards, and actions, making it a drop-in replacement. This ensures that historical results obtained with MiniGrid can be reproduced while benefiting from the enhanced performance of NAVIX.
Flexible Design: The paper details the design philosophy based on the Entity-Component-System Model (ECSM), allowing for modular and extensible environment configurations. This design makes NAVIX not only efficient but also adaptable to a wide variety of research needs in DRL.
Baselines and Benchmarks: The authors provide a comprehensive set of baselines using popular DRL algorithms like PPO, DDQN, and SAC, demonstrating the effectiveness of NAVIX across different environments and configurations. They also provide a benchmark suite to facilitate the comparison of new algorithms against state-of-the-art performance.

Numerical Results and Claims

NAVIX achieves speedups of 45-128x over the original MiniGrid implementation in a variety of environments.
The throughput of NAVIX is up to 2048 agents trained in parallel, each with their own subset of environments, on a single GPU, showcasing the system's remarkable scalability.
Training 2048 PPO agents on NAVIX-Empty-5x5-v0 takes less than 50 seconds for 1 million steps, compared to the original MiniGrid's 240 seconds for just one agent, demonstrating over 200,000x speed improvement in batch mode.

Theoretical and Practical Implications

The introduction of NAVIX has several theoretical and practical implications for the field of DRL:

Enable Large-scale Experiments: The ability to run large-scale experiments on a single GPU can significantly accelerate the development and testing of new DRL algorithms. Faster design iterations and reduced experiment times promote more thorough exploration of the algorithmic design space.
High-Throughput Training: The vastly improved throughput allows researchers to perform extensive hyperparameter tuning and investigate meta-learning, curriculum learning, and transfer learning more effectively.
Reproducibility and Standardization: By matching the original MiniGrid environment suite, NAVIX provides a standardized and reproducible platform for benchmarking DRL algorithms, ensuring that advancements in the field are built on a consistent foundation.

Future Developments

Given the impressive results and flexibility of NAVIX, several future developments and research directions can be envisioned:

Expansion of Environment Suite: Extending NAVIX to include a wider variety of environments from other benchmark suites can further enhance its utility.
Integration with Other Libraries: Integrating NAVIX with other popular DRL libraries and frameworks can facilitate its adoption and make it easier for researchers to leverage its capabilities.
Enhanced Observability and Debugging: Developing tools and techniques for enhanced observability and debugging within NAVIX environments can help researchers better understand the behavior of their DRL agents.
Advanced Meta-RL and Multi-Agent Training: The scalability of NAVIX opens new avenues for research in meta-RL and multi-agent RL, where training thousands of agents in parallel can lead to new insights and more effective algorithms.

In conclusion, NAVIX represents a significant advancement in the toolset available to DRL researchers, offering substantial performance improvements, scalability, and flexibility. Its ability to leverage modern accelerators fully makes it a critical resource for pushing the boundaries of what is achievable in DRL research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/edupignatelli/status/1819008928082002288

https://twitter.com/jreuben1/status/1820308632879177895

https://twitter.com/tonizza82/status/1825453309206974843

YouTube

Show All Videos

Reddit

"NAVIX: Scaling MiniGrid Environments with JAX", Pignatelli et al 2024 (packing agents+env onto GPU for OOM gains) (4 points, 1 comment)