AffordPoseNet: Geometric 3D Pose Estimation
- AffordPoseNet is a convolutional neural network that estimates 3D human poses from a single image using multi-layer depth maps and geometric priors.
- It enforces geometric consistency by integrating scene constraints through dedicated encoding and a differentiable no-penetration loss.
- Empirical evaluations on the GPA dataset demonstrate significant MPJPE improvements, especially under occlusion and close-to-geometry conditions.
AffordPoseNet is a convolutional neural network architecture for monocular 3D human pose estimation in scenes with clutter and occlusion, designed to integrate strong priors on scene geometry via multi-layer depth representations and geometric consistency constraints. AffordPoseNet leverages a dedicated dataset capturing real human-scene interaction in richly structured, multi-camera environments, and applies novel mechanisms for fusing geometric context at both the input and loss-function levels. Its construction and evaluation are detailed in "Geometric Pose Affordance: 3D Human Pose with Scene Constraints" (Wang et al., 2019).
1. Dataset Construction and Scene Representation
AffordPoseNet builds upon the Geometric Pose Affordance (GPA) dataset, facilitating comprehensive evaluation of geometry-aware pose estimation in structured environments. The dataset comprises 13 human subjects (9 male, 4 female, heights 1.55–1.90 m) engaging in three scenario types: Action Set (semantic actions like “Greeting,” “Walking Dog”), Motion Set (dynamic running, jumping), and Interaction Set (close-contact situations such as “Sitting,” “Touching,” or “Standing On” objects, corresponding to sit-able, walkable, or reachable affordances). Actors are recorded in six static mocap studio arrangements containing nine cuboid boxes and, in some cases, a chair or stair platform, deliberately engineered to induce heavy occlusion and substantial human-scene contact.
Synchronized data capture uses two RGB cameras (1920×1080, 30 fps), three RGBD Kinects (640×480 depth × 1920×1080 color, 30 fps), and a VICON system (28 markers, 120fps), yielding ground-truth 3D skeletons (34 joints) and controlled camera calibration. Scene geometry is meticulously constructed from manual measurement, Kinect mesh scans, and mocap geometry markers, producing precise, co-registered 3D meshes for each studio layout.
After temporal subsampling based on the 75th-percentile L2-joint movement and a 55th-percentile activity threshold, the finalized GPA dataset comprises 304,900 RGB frames, split into training and 82,400 held-out test images. Several test partitions target challenging generalization regimes: Action, Motion, Interaction, Cross-Subject, Cross-Action, Occlusion (≥10 joints occluded), and Close-to-Geometry (≥8 joints within 175 mm of a surface) subsets.
Scene geometry is encoded using the multi-layer depth map (MDM) representation. For each calibrated camera view, and every pixel (x,y), unit rays are cast and all mesh intersections are computed (with a truncation at layers). The resulting tensor , dimension , records the ordered entry/exit depth values for each surface hit along the ray, padded as necessary. encodes standard depth, while higher-index layers capture subsequent entries/exits for complex, potentially nested geometry.
2. Network Architecture and Geometric Conditioning
All AffordPoseNet model variants employ a ResNet-50 backbone to extract spatial features from 256×256 RGB person crops, yielding per-pixel feature maps of 64×64× (). The architecture bifurcates into two heads:
- 2D Heatmap Head: A 1×1 convolution predicts depth-aggregated heatmaps for skeleton joints, trained by squared distance to Gaussian ground-truth heatmaps with px.
- Depth Regression Head: Several convolutional layers and a fully connected network regress the -coordinates for all target joints.
Two design mechanisms integrate geometry:
- Geometry Encoding (ResNet-E): Pre-rendered MDMs, cropped and rescaled to for the person-centerd crop, are optionally root-depth normalized and concatenated to the ResNet feature map at the penultimate layer. Simple channel-wise concatenation is used in practice; a 1×1 gated fusion is possible but showed no empirical advantage.
- Differentiable Geometric-Consistency Loss (ResNet-C): At training, each predicted joint is projected onto its ray in the MDM. The predicted depth must lie in one of the “free space” intervals (0, ) ∪ (, ) ∪ ..., outside the “occupied” volumes of the mesh. A geometric penalty is accumulated per joint:
Summed over , the total loss is differentiable almost everywhere, enforcing plausible body-part locations with respect to the 3D scene.
A combined model (ResNet-F) applies both mechanisms jointly.
3. Training Strategy and Optimization
AffordPoseNet training proceeds in three phases:
- 2D Pose Pretraining: Initialize the 2D heatmap head on a large-scale 2D dataset (MPII, 25,000 images) with the loss only.
- 3D Depth Head Training: Add depth head; train end-to-end on the GPA train split (~220,000 images), minimizing , with being the smooth- (Huber) loss between predicted and GT joint depths.
- Geometry-Aware Fine-Tuning: Activate both geometry-encoding and geometric consistency loss, and train end-to-end with total objective (using ).
Optimization employs Adam (β₁=0.9, β₂=0.999) with a learning rate of 1e-4 for initial 3D stages, decayed to 5e-5 for geometry fine-tuning. Batch size is 64; data augmentation includes random scaling (±30%), rotation (±30°), and horizontal flipping (50%). All depths are root-relative, divided by 2000 mm, then scaled to [0, 1] before entering the network.
4. Empirical Performance and Ablation Results
Quantitative evaluation on GPA demonstrates the efficacy of geometric conditioning:
| Model Variant | MPJPE (mm) | Relative Gain (mm) |
|---|---|---|
| ResNet-baseline | 96.6 | — |
| ResNet-E | 94.6 | –2.0 |
| ResNet-C | 95.4 | –1.2 |
| ResNet-F | 94.1 | –2.5 |
On challenging test subsets, the full model reduces MPJPE by 5.4 mm (Occlusion: 120.5 → 115.1) and by 6.6 mm on Close-to-Geometry samples (118.1 → 111.5).
For 2D→3D “lifting” MLP baselines (denoted SIM-G), fusing both geometry features and loss decreases error from 68.2 to 64.6 mm.
Ablations indicate both geometry-encoding (“E”) and consistency-loss (“C”) yield independent improvements, with maximum gains for joints subject to occlusion (wrist, ankle, knee). Masking the RGB background increases baseline MPJPE on Close-to-Geometry from 70.7 to 78.7 mm, suggesting that networks can otherwise implicitly exploit RGB-inferred scene layout.
5. Implementation Recipe and Practical Use
A concise procedural summary for constructing an AffordPoseNet system:
- Precompute per-camera multi-layer depth maps for each static scene.
- Construct a ResNet-50 backbone to extract .
- Fork into two heads:
- Heatmap head ()
- Depth regression head ()
- (Recommended) Concatenate person-cropped, rescaled MDM at matching resolution to spatial feature map.
- At training, for each predicted joint , lookup corresponding MDM ray and accumulate consistency loss .
- Train in three stages using Adam optimizer.
- At inference, MDM is required if geometry encoding is used; otherwise, models trained only with geometric loss need no additional input.
Adaptation to other datasets, or more complex gating/fusion schemes for MDM features, is possible.
6. Significance and Research Context
AffordPoseNet provides a systematic framework to enforce plausible human pose estimation in cluttered, structured environments, leveraging explicit geometric priors beyond standard RGB-based inference. It operationalizes the use of multi-layer depth maps for both direct feature-level conditioning and geometric supervision via a differentiable no-penetration loss. The approach yields robust improvements, especially in occluded or “affordance-critical” scenarios, and highlights the value of curated datasets aligning human activity with precisely-measured 3D scenes.
Results confirm that geometry-informed constraints can complement generic deep pose estimation pipelines, setting a precedent for future work in affordance-aware perception, embodied reasoning, and complex scene understanding from vision (Wang et al., 2019).