PolaRiS: Policy Eval & Env Reconstruction

Updated 25 December 2025

PolaRiS is a scalable framework that reconstructs interactive simulation environments from minimal real-world data using neural methods and co-training.
The system bridges sim-to-real gaps with high fidelity, enabling reproducible, large-scale policy evaluation for both single-agent and multi-agent systems.
It integrates techniques like 2D Gaussian Splatting, TSDF fusion, and video world modeling to ensure robust policy benchmarking across diverse applications.

Policy Evaluation and Environment Reconstruction in Simulation (PolaRiS) is a scalable framework designed for high-fidelity evaluation and rapid environment creation in the context of robotic manipulation and multi-agent system policy benchmarking. PolaRiS leverages neural reconstruction techniques and simulation co-training to bridge sim-to-real domain gaps, enabling large-scale, reproducible, and automated policy assessment. The paradigm encompasses both single-agent robotic policy evaluation and broader agent-based simulation contexts, as characterized by its deployments in robotic generalist policy benchmarking and socio-technical systems.

1. Conceptual Foundations and Motivation

Policy evaluation in robotics and multi-agent systems traditionally relies on costly, time-intensive real-world rollouts. The stochasticity and irreproducibility of physical interactions, coupled with the engineering burden of manual simulation environment creation, have hindered scalable benchmarking, especially for generalist policies requiring diverse task coverage. PolaRiS addresses these challenges by reconstructing interactive simulation environments from minimal real-world data—typically short monocular video scans—while integrating simulation co-training protocols to minimize visual and physical domain mismatch. The framework aims to democratize environment creation and facilitate distributed evaluation for foundation models and complex multi-agent systems (Jain et al., 18 Dec 2025).

2. Environment Reconstruction Pipeline

The PolaRiS pipeline begins with rapid capture of real-world scenes. Handheld monocular videos (2–5 min per sequence) are recorded for both static backgrounds and articulated robots. ChArUco boards are used to anchor global scale and orientation. Camera pose estimation is performed using COLMAP, aligning all image frames to a canonical reference.

Neural reconstruction is executed via 2D Gaussian Splatting (2DGS), where each scene element—including the background and each robotic link—is modeled as planar Gaussian disks with center $\mu_j\in\mathbb{R}^3$ , orientation $R_j\in SO(3)$ , and per-disk covariance $\Sigma_j$ . Differentiable rendering projects these disks into each frame with alpha compositing, optimizing the photometric loss:

$\mathcal{L}(\{g_j\}) = \sum_{i=1}^N \| \hat I_i - I_i \|_2^2 + \lambda_{\text{dist}}\,\mathcal{R}_{\text{dist}} + \lambda_{\text{norm}}\,\mathcal{R}_{\text{norm}}$

Mesh extraction uses TSDF fusion and Marching Cubes to produce watertight physical meshes suitable for rigid-body dynamics. Miscellaneous object assets are generated using TRELLIS, an image-to-3D generator that yields both Gaussian splat and textured mesh representations. Scene composition is performed in a GUI, where assets are arranged in physical space, parameters are set, and USD scenes are exported for simulation (Jain et al., 18 Dec 2025).

The resulting environments are fully interactive IsaacSim worlds supporting batch policy rollouts, rendering from both static and moving camera perspectives. In the case of deformable-object tasks, PolaRiS integrates dense spring-mass system modeling (PhysTwin) and PhysTwin-driven system identification for real-world dynamics (Zhang et al., 6 Nov 2025).

3. Simulation Co-Training and Domain Adaptation

To address residual domain gaps, PolaRiS employs a simulation co-training protocol. A small out-of-domain simulated demonstration dataset is generated:

$\mathcal{D}_\mathcal{S} = \{(o_t^\mathcal{S},\,a_t^\mathcal{S},\,p_t^\mathcal{S})\}_{t=1}^M$

where $o_t^\mathcal{S}$ are rendered observations, $a_t^\mathcal{S}$ are teleoperated actions, and $p_t^\mathcal{S}$ are proprioceptive states. Generalist robot policies (e.g., Vision-Language-Action models) are fine-tuned by mixing real pretraining data $\mathcal{D}_{\text{pre}}$ and simulated data $\mathcal{D}_\mathcal{S}$ :

$\theta \leftarrow \theta - \eta \nabla_\theta \mathbb{E}_{(o,a,p)\sim (1-\lambda)\,\mathcal{D}_{\text{pre}} + \lambda\,\mathcal{D}_\mathcal{S}} [\, \mathcal{L}_{\text{BC}}(\pi_\theta(o,p), a)\, ]$

with $\lambda\approx0.1$ (Jain et al., 18 Dec 2025). This co-training step effectively aligns visual and physical features between domains, enabling zero-shot generalization to unseen scenes without environment-specific adjustment.

4. Policy Evaluation Methodologies

In PolaRiS, policy evaluation comprises large-scale simulated rollouts and statistical metrics quantifying sim-to-real fidelity. Each scene and policy configuration is subject to $N$ repeat rollouts, and success is scored on dense rubrics ( $[0,1]$ range or binary for completion). Two primary statistical measures are used:

Pearson Correlation Coefficient ( $r$ ): Quantifies the linear relationship between real-world performance $R_i$ and simulated performance $R_{\mathcal{S},i}$ over candidate policies:

$r(R, R_\mathcal{S}) = \frac{ \sum_i (R_i - \bar R)(R_{\mathcal{S},i} - \bar R_\mathcal{S}) }{ \sqrt{\sum_i (R_i - \bar R)^2} \sqrt{\sum_i (R_{\mathcal{S},i} - \bar R_\mathcal{S})^2} }$

Mean Maximum Rank Violation (MMRV): Assesses the severity of policy mis-ranking induced by simulation, as the maximal margin of misordered pairs:

$\mathrm{MMRV}(R, R_\mathcal{S}) = \frac{1}{N} \sum_{i=1}^N \max_{j} |R_i-R_j|\,\mathbf{1}[ (R_{\mathcal{S},i}<R_{\mathcal{S},j}) \neq (R_i<R_j) ]$

Empirical results demonstrate that PolaRiS achieves $r=0.81$ –$0.95$ per scene, outperforming Libero-Score and Ctrl-World metrics by substantial margins (Jain et al., 18 Dec 2025), with stringent MMRV ( $<0.05$ in typical robotic manipulation benchmarks). For deformable-object evaluation, the system achieves $r>0.90$ correlations and detailed agreement in policy behavioral patterns (Zhang et al., 6 Nov 2025).

5. Integration with Video World Models and Agent-Based Simulation

PolaRiS can directly interoperate with action-conditioned video world models for policy evaluation. Diffusion-transformer architectures, such as Cosmos-Predict2-Video2World-2B, serve as scalable learned simulators by predicting action-conditional frame distributions $p(o_{t+1}|o_{1:t},a_{1:t})$ . Policy rollouts in such world models are classified by frozen vision-LLMs (VLMs), generating binary task success scores for cumulative simulated return $R_{S,i}$ , which are then compared to ground-truth returns using Pearson and MMRV. This approach bypasses manual environment asset creation and offers rapid, flexible policy assessment without hardware (Tseng et al., 14 Nov 2025).

In broader agent-based domains, such as urban mobility, the POLARIS framework reconstructs synthetic populations, networks, and multimodal flows, integrating empirical socio-demographic data to generate time-dependent equilibria via dynamic traffic and transit assignment. Policy interventions are encoded as parameter modifications and outcomes are aggregated at the system level for policy analysis (Auld et al., 2024, Verbas et al., 2024).

6. Application Domains and Key Results

PolaRiS has demonstrated utility across multiple domains:

Robotics: Enables distributed benchmarking of generalist policies, competitive against advanced simulation baselines. Rapid (<1 hr) environment generation facilitates large-scale, reproducible evaluative studies. Real-to-sim performance correlations approach $r=0.98$ vs real-world DROID rankings, with substantial reduction in manual tuning and asset preparation.
Deformable-Object Manipulation: Gaussian Splatting and PhysTwin integration enables photorealistic, physics-consistent policy evaluation for complex tasks (e.g., rope routing), achieving $r>0.90$ on success and behavioral metrics (Zhang et al., 6 Nov 2025).
Urban System Policy Analysis: POLARIS agent-based simulation quantifies interaction effects among mobility, equity, energy, and emissions policy levers in metropolitan contexts (Auld et al., 2024, Verbas et al., 2024).
Subjective-Objective Policy Selection: SOPMA-style coupling leverages subjective resident value regressions and objective MABS indices for optimized policy recommendation from large candidate sets (Owa et al., 2023).

7. Limitations, Extensions, and Future Prospects

Current PolaRiS limitations include incomplete sim-to-real bridging for highly nonlinear or non-rigid dynamics (cloth, fluids), modest visual fidelity gap relative to high-end diffusion renderers, and reliance on nominal mechanical parameters for simulation accuracy. Prospective extensions involve integrating diffusion-based or NeRF-style rendering with Gaussian Splatting, automated system identification for friction/mass tuning, and hybrid learned simulation rollouts. The framework’s rapid environment creation and OOD co-training model offer scalability for automated, cloud-based benchmarking and distributed scene libraries (Jain et al., 18 Dec 2025).

A plausible implication is that large-scale, zero-shot policy evaluation—previously unattainable—now becomes feasible via PolaRiS, supporting open, collaborative benchmarking and iterative policy improvement for robotic foundation models and agent-based systems.