AffordPoseNet: Geometric 3D Pose Estimation

Updated 22 March 2026

AffordPoseNet is a convolutional neural network that estimates 3D human poses from a single image using multi-layer depth maps and geometric priors.
It enforces geometric consistency by integrating scene constraints through dedicated encoding and a differentiable no-penetration loss.
Empirical evaluations on the GPA dataset demonstrate significant MPJPE improvements, especially under occlusion and close-to-geometry conditions.

AffordPoseNet is a convolutional neural network architecture for monocular 3D human pose estimation in scenes with clutter and occlusion, designed to integrate strong priors on scene geometry via multi-layer depth representations and geometric consistency constraints. AffordPoseNet leverages a dedicated dataset capturing real human-scene interaction in richly structured, multi-camera environments, and applies novel mechanisms for fusing geometric context at both the input and loss-function levels. Its construction and evaluation are detailed in "Geometric Pose Affordance: 3D Human Pose with Scene Constraints" (Wang et al., 2019).

1. Dataset Construction and Scene Representation

AffordPoseNet builds upon the Geometric Pose Affordance (GPA) dataset, facilitating comprehensive evaluation of geometry-aware pose estimation in structured environments. The dataset comprises 13 human subjects (9 male, 4 female, heights 1.55–1.90 m) engaging in three scenario types: Action Set (semantic actions like “Greeting,” “Walking Dog”), Motion Set (dynamic running, jumping), and Interaction Set (close-contact situations such as “Sitting,” “Touching,” or “Standing On” objects, corresponding to sit-able, walkable, or reachable affordances). Actors are recorded in six static mocap studio arrangements containing nine cuboid boxes and, in some cases, a chair or stair platform, deliberately engineered to induce heavy occlusion and substantial human-scene contact.

Synchronized data capture uses two RGB cameras (1920×1080, 30 fps), three RGBD Kinects (640×480 depth × 1920×1080 color, 30 fps), and a VICON system (28 markers, 120fps), yielding ground-truth 3D skeletons (34 joints) and controlled camera calibration. Scene geometry is meticulously constructed from manual measurement, Kinect mesh scans, and mocap geometry markers, producing precise, co-registered 3D meshes for each studio layout.

After temporal subsampling based on the 75th-percentile L2-joint movement and a 55th-percentile activity threshold, the finalized GPA dataset comprises 304,900 RGB frames, split into training and 82,400 held-out test images. Several test partitions target challenging generalization regimes: Action, Motion, Interaction, Cross-Subject, Cross-Action, Occlusion (≥10 joints occluded), and Close-to-Geometry (≥8 joints within 175 mm of a surface) subsets.

Scene geometry is encoded using the multi-layer depth map (MDM) representation. For each calibrated camera view, and every pixel (x,y), unit rays are cast and all mesh intersections $t_1 < t_2 < ... < t_k$ are computed (with a truncation at $L=15$ layers). The resulting tensor $D_i(x,y)$ , dimension $H\times W\times L$ , records the ordered entry/exit depth values for each surface hit along the ray, padded as necessary. $D_1$ encodes standard depth, while higher-index layers capture subsequent entries/exits for complex, potentially nested geometry.

2. Network Architecture and Geometric Conditioning

All AffordPoseNet model variants employ a ResNet-50 backbone to extract spatial features from 256×256 RGB person crops, yielding per-pixel feature maps of 64×64× $F$ ( $F\approx 256$ ). The architecture bifurcates into two heads:

2D Heatmap Head: A 1×1 convolution predicts depth-aggregated heatmaps $\hat{H}\in\mathbb{R}^{64\times 64\times J}$ for $J$ skeleton joints, trained by squared distance to Gaussian ground-truth heatmaps with $\sigma=3$ px.
Depth Regression Head: Several convolutional layers and a fully connected network regress the $Z$ -coordinates $\hat{P}_Z\in\mathbb{R}^J$ for all target joints.

Two design mechanisms integrate geometry:

Geometry Encoding (ResNet-E): Pre-rendered MDMs, cropped and rescaled to $64\times 64\times L$ for the person-centerd crop, are optionally root-depth normalized and concatenated to the ResNet feature map at the penultimate layer. Simple channel-wise concatenation is used in practice; a 1×1 gated fusion is possible but showed no empirical advantage.
Differentiable Geometric-Consistency Loss (ResNet-C): At training, each predicted joint $\hat{P}^j=(x^j, y^j, z^j)$ is projected onto its ray in the MDM. The predicted depth $z^j$ must lie in one of the “free space” intervals (0, $D_1$ ) ∪ ( $D_2$ , $D_3$ ) ∪ ..., outside the “occupied” volumes of the mesh. A geometric penalty is accumulated per joint:

$\ell_G(\hat{P}^j|D) = \max_{i\in\{0,2,4,...\}} \min \big[ \max(0, z^j-D_i(x^j,y^j)),\ \max(0, D_{i+1}(x^j,y^j)-z^j) \big]$

Summed over $j$ , the total loss is differentiable almost everywhere, enforcing plausible body-part locations with respect to the 3D scene.

A combined model (ResNet-F) applies both mechanisms jointly.

3. Training Strategy and Optimization

AffordPoseNet training proceeds in three phases:

2D Pose Pretraining: Initialize the 2D heatmap head on a large-scale 2D dataset (MPII, 25,000 images) with the $\ell_{2D}$ loss only.
3D Depth Head Training: Add depth head; train end-to-end on the GPA train split (~220,000 images), minimizing $\ell_{2D}+\ell_{1s}$ , with $\ell_{1s}$ being the smooth- $\ell_1$ (Huber) loss between predicted and GT joint depths.
Geometry-Aware Fine-Tuning: Activate both geometry-encoding and geometric consistency loss, and train end-to-end with total objective $\ell = \ell_{2D} + \ell_{1s} + \alpha\ell_G$ (using $\alpha=1.0$ ).

Optimization employs Adam (β₁=0.9, β₂=0.999) with a learning rate of 1e-4 for initial 3D stages, decayed to 5e-5 for geometry fine-tuning. Batch size is 64; data augmentation includes random scaling (±30%), rotation (±30°), and horizontal flipping (50%). All depths are root-relative, divided by 2000 mm, then scaled to [0, 1] before entering the network.

4. Empirical Performance and Ablation Results

Quantitative evaluation on GPA demonstrates the efficacy of geometric conditioning:

Model Variant	MPJPE (mm)	Relative Gain (mm)
ResNet-baseline	96.6	—
ResNet-E	94.6	–2.0
ResNet-C	95.4	–1.2
ResNet-F	94.1	–2.5

On challenging test subsets, the full model reduces MPJPE by 5.4 mm (Occlusion: 120.5 → 115.1) and by 6.6 mm on Close-to-Geometry samples (118.1 → 111.5).

For 2D→3D “lifting” MLP baselines (denoted SIM-G), fusing both geometry features and loss decreases error from 68.2 to 64.6 mm.

Ablations indicate both geometry-encoding (“E”) and consistency-loss (“C”) yield independent improvements, with maximum gains for joints subject to occlusion (wrist, ankle, knee). Masking the RGB background increases baseline MPJPE on Close-to-Geometry from 70.7 to 78.7 mm, suggesting that networks can otherwise implicitly exploit RGB-inferred scene layout.

5. Implementation Recipe and Practical Use

A concise procedural summary for constructing an AffordPoseNet system:

Precompute per-camera multi-layer depth maps $D\in\mathbb{R}^{H\times W\times L}$ for each static scene.
Construct a ResNet-50 backbone to extract $F\in\mathbb{R}^{64\times 64\times C}$ .
Fork into two heads:
- Heatmap head ( $\ell_{2D}$ )
- Depth regression head ( $\ell_{1s}$ )
(Recommended) Concatenate person-cropped, rescaled MDM $D$ at matching resolution to spatial feature map.
At training, for each predicted joint $(x^j, y^j, z^j)$ , lookup corresponding MDM ray $D_i(x^j,y^j)$ and accumulate consistency loss $\ell_G$ .
Train in three stages using Adam optimizer.
At inference, MDM is required if geometry encoding is used; otherwise, models trained only with geometric loss need no additional input.

Adaptation to other datasets, or more complex gating/fusion schemes for MDM features, is possible.

6. Significance and Research Context

AffordPoseNet provides a systematic framework to enforce plausible human pose estimation in cluttered, structured environments, leveraging explicit geometric priors beyond standard RGB-based inference. It operationalizes the use of multi-layer depth maps for both direct feature-level conditioning and geometric supervision via a differentiable no-penetration loss. The approach yields robust improvements, especially in occluded or “affordance-critical” scenarios, and highlights the value of curated datasets aligning human activity with precisely-measured 3D scenes.

Results confirm that geometry-informed constraints can complement generic deep pose estimation pipelines, setting a precedent for future work in affordance-aware perception, embodied reasoning, and complex scene understanding from vision (Wang et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

Geometric Pose Affordance: 3D Human Pose with Scene Constraints (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AffordPoseNet.