Emergent Mind

Diffusion Models Are Real-Time Game Engines

(2408.14837)
Published Aug 27, 2024 in cs.LG , cs.AI , and cs.CV

Abstract

We present GameNGen, the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality. GameNGen can interactively simulate the classic game DOOM at over 20 frames per second on a single TPU. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations enable stable auto-regressive generation over long trajectories.

GameNGen versus previous state-of-the-art DOOM simulations.

Overview

  • The paper introduces GameNGen, an innovative real-time game simulation engine using an augmented diffusion model based on Stable Diffusion v1.4, capable of simulating the game DOOM at over 20 FPS on a single TPU.

  • GameNGen employs a two-phase training process combining reinforcement learning for agent training and generative diffusion model training, achieving high visual fidelity as evidenced by metrics like PSNR and human evaluations.

  • The research highlights significant implications for game development, suggesting potential reductions in development costs, enhanced interactivity, and paves the way for future applications in simulations and training environments.

Diffusion Models Are Real-Time Game Engines

The paper "Diffusion Models Are Real-Time Game Engines" presents GameNGen, an innovative application of neural models for real-time game simulation. The study demonstrates that diffusion models, specifically an augmented variant of Stable Diffusion v1.4, can simulate the classic game DOOM at a rate of over 20 frames per second on a single TPU. This work shows promising results in using neural models to handle complex, interactive virtual environments, a domain traditionally dominated by manually crafted software systems.

Summary of Core Contributions

GameNGen Architecture: The GameNGen system is built upon a pre-trained Stable Diffusion v1.4 model, which has been adapted for interactive world simulation. The model operates in two training phases: an agent is first trained to play the game using reinforcement learning (RL), and then the generative diffusion model is trained on the accumulated data from the agent’s gameplay. The model is conditioned on sequences of past frames and actions, enabling autoregressive generation of game frames.

Performance Metrics and Model Efficacy:

  • Frame Rate: GameNGen achieves real-time performance at 20 FPS, showing the computational efficiency of the approach.
  • Next Frame Prediction: It achieves a Peak Signal-to-Noise Ratio (PSNR) of 29.4, which is comparable to common lossy JPEG compression levels.
  • Human Evaluation: Human raters could barely distinguish between short clips of the real game and the simulation, indicating high visual fidelity.
  • Noise Augmentation: The model incorporates a noise augmentation technique to mitigate autoregressive drift, crucial for maintaining visual quality over long trajectories.

Simulation Quality and Evaluation

GameNGen is evaluated using various metrics to ensure high simulation quality:

  • Image Quality: When evaluated in a teacher-forcing setup for single-frame prediction, the model achieves a PSNR of 29.43 and an LPIPS of 0.249, metrics indicative of high visual fidelity.
  • Video Quality: Evaluated in an autoregressive context, the model delivers an FVD of 114.02 for 16-frame sequences and 186.23 for 32-frame sequences.
  • Human Evaluation: When tasked with distinguishing between simulated and real game clips, human evaluators did so only slightly better than random chance, underscoring the model's ability to produce visually convincing outputs.

Methodological Details

Data Collection: The agent, trained via PPO (Proximal Policy Optimization) using a simple CNN architecture, generates the training dataset by playing the game in a variety of scenarios. The collected trajectories include diverse gameplay situations, ensuring the training data is rich and varied.

Training Procedure:

  • The generative model is re-purposed from Stable Diffusion v1.4, removing text conditioning and introducing embeddings for past actions.
  • The latent decoder of the original diffusion model is fine-tuned to resolve artifacts such as HUD details.
  • DDIM sampling and Classifier-Free Guidance are used during inference to balance quality and computational efficiency.

Mitigation of Autoregressive Drift: By adding Gaussian noise to context frames during training, the model learns to correct inaccuracies over time, which is critical for long-term stability in autoregressive scenarios.

Ablations and Analysis

The authors conduct comprehensive ablations to analyze the contributions of various components:

  • Context Length: Increasing the number of history frames improves the model’s performance, though gains diminish beyond a certain point.
  • Noise Augmentation: Demonstrably enhances autoregressive stability, a critical aspect for maintaining long-frame visual quality.
  • Agent Play: Comparing training data generated by an agent versus random policy highlights the agent's importance in producing a robust and diverse dataset.

Implications and Future Directions

Practical Implications:

  • The success of GameNGen implies potential reductions in game development costs by automating game environment creation.
  • Enhanced interactivity: Such models can enable novel ways for user interaction within virtual environments, offering adaptability that static, rule-based engines cannot match.

Theoretical Implications:

  • This research points towards a new paradigm in game engine design where neural models supplant manually written code.
  • The method paves the way for further exploration into using generative models for interactive applications beyond gaming, such as simulation and training environments.

Future Work: The paper outlines several future avenues:

  • Extending GameNGen to other games or interactive applications.
  • Addressing the model’s memory constraints by experimenting with architectural modifications to support longer conditioning contexts.
  • Further optimizing model performance for higher frame rates and deployment on consumer hardware.

Conclusion

"Diffusion Models Are Real-Time Game Engines" marks a significant step in applying neural models to a traditionally hand-crafted domain. By demonstrating the feasibility and potential of GameNGen, this research lays the groundwork for an automated, neural-network-driven future in game engine design, potentially transforming both the development and user experience of interactive virtual environments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Reddit
Google Research: Diffusion Models Are Real-Time Game Engines (5 points, 3 comments) in /r/programiranje