Papers
Topics
Authors
Recent
Search
2000 character limit reached

AffordPoseNet: Geometric 3D Pose Estimation

Updated 22 March 2026
  • AffordPoseNet is a convolutional neural network that estimates 3D human poses from a single image using multi-layer depth maps and geometric priors.
  • It enforces geometric consistency by integrating scene constraints through dedicated encoding and a differentiable no-penetration loss.
  • Empirical evaluations on the GPA dataset demonstrate significant MPJPE improvements, especially under occlusion and close-to-geometry conditions.

AffordPoseNet is a convolutional neural network architecture for monocular 3D human pose estimation in scenes with clutter and occlusion, designed to integrate strong priors on scene geometry via multi-layer depth representations and geometric consistency constraints. AffordPoseNet leverages a dedicated dataset capturing real human-scene interaction in richly structured, multi-camera environments, and applies novel mechanisms for fusing geometric context at both the input and loss-function levels. Its construction and evaluation are detailed in "Geometric Pose Affordance: 3D Human Pose with Scene Constraints" (Wang et al., 2019).

1. Dataset Construction and Scene Representation

AffordPoseNet builds upon the Geometric Pose Affordance (GPA) dataset, facilitating comprehensive evaluation of geometry-aware pose estimation in structured environments. The dataset comprises 13 human subjects (9 male, 4 female, heights 1.55–1.90 m) engaging in three scenario types: Action Set (semantic actions like “Greeting,” “Walking Dog”), Motion Set (dynamic running, jumping), and Interaction Set (close-contact situations such as “Sitting,” “Touching,” or “Standing On” objects, corresponding to sit-able, walkable, or reachable affordances). Actors are recorded in six static mocap studio arrangements containing nine cuboid boxes and, in some cases, a chair or stair platform, deliberately engineered to induce heavy occlusion and substantial human-scene contact.

Synchronized data capture uses two RGB cameras (1920×1080, 30 fps), three RGBD Kinects (640×480 depth × 1920×1080 color, 30 fps), and a VICON system (28 markers, 120fps), yielding ground-truth 3D skeletons (34 joints) and controlled camera calibration. Scene geometry is meticulously constructed from manual measurement, Kinect mesh scans, and mocap geometry markers, producing precise, co-registered 3D meshes for each studio layout.

After temporal subsampling based on the 75th-percentile L2-joint movement and a 55th-percentile activity threshold, the finalized GPA dataset comprises 304,900 RGB frames, split into training and 82,400 held-out test images. Several test partitions target challenging generalization regimes: Action, Motion, Interaction, Cross-Subject, Cross-Action, Occlusion (≥10 joints occluded), and Close-to-Geometry (≥8 joints within 175 mm of a surface) subsets.

Scene geometry is encoded using the multi-layer depth map (MDM) representation. For each calibrated camera view, and every pixel (x,y), unit rays are cast and all mesh intersections t1<t2<...<tkt_1 < t_2 < ... < t_k are computed (with a truncation at L=15L=15 layers). The resulting tensor Di(x,y)D_i(x,y), dimension H×W×LH\times W\times L, records the ordered entry/exit depth values for each surface hit along the ray, padded as necessary. D1D_1 encodes standard depth, while higher-index layers capture subsequent entries/exits for complex, potentially nested geometry.

2. Network Architecture and Geometric Conditioning

All AffordPoseNet model variants employ a ResNet-50 backbone to extract spatial features from 256×256 RGB person crops, yielding per-pixel feature maps of 64×64×FF (F256F\approx 256). The architecture bifurcates into two heads:

  • 2D Heatmap Head: A 1×1 convolution predicts depth-aggregated heatmaps H^R64×64×J\hat{H}\in\mathbb{R}^{64\times 64\times J} for JJ skeleton joints, trained by squared distance to Gaussian ground-truth heatmaps with σ=3\sigma=3 px.
  • Depth Regression Head: Several convolutional layers and a fully connected network regress the ZZ-coordinates P^ZRJ\hat{P}_Z\in\mathbb{R}^J for all target joints.

Two design mechanisms integrate geometry:

  1. Geometry Encoding (ResNet-E): Pre-rendered MDMs, cropped and rescaled to 64×64×L64\times 64\times L for the person-centerd crop, are optionally root-depth normalized and concatenated to the ResNet feature map at the penultimate layer. Simple channel-wise concatenation is used in practice; a 1×1 gated fusion is possible but showed no empirical advantage.
  2. Differentiable Geometric-Consistency Loss (ResNet-C): At training, each predicted joint P^j=(xj,yj,zj)\hat{P}^j=(x^j, y^j, z^j) is projected onto its ray in the MDM. The predicted depth zjz^j must lie in one of the “free space” intervals (0, D1D_1) ∪ (D2D_2, D3D_3) ∪ ..., outside the “occupied” volumes of the mesh. A geometric penalty is accumulated per joint:

G(P^jD)=maxi{0,2,4,...}min[max(0,zjDi(xj,yj)), max(0,Di+1(xj,yj)zj)]\ell_G(\hat{P}^j|D) = \max_{i\in\{0,2,4,...\}} \min \big[ \max(0, z^j-D_i(x^j,y^j)),\ \max(0, D_{i+1}(x^j,y^j)-z^j) \big]

Summed over jj, the total loss is differentiable almost everywhere, enforcing plausible body-part locations with respect to the 3D scene.

A combined model (ResNet-F) applies both mechanisms jointly.

3. Training Strategy and Optimization

AffordPoseNet training proceeds in three phases:

  1. 2D Pose Pretraining: Initialize the 2D heatmap head on a large-scale 2D dataset (MPII, 25,000 images) with the 2D\ell_{2D} loss only.
  2. 3D Depth Head Training: Add depth head; train end-to-end on the GPA train split (~220,000 images), minimizing 2D+1s\ell_{2D}+\ell_{1s}, with 1s\ell_{1s} being the smooth-1\ell_1 (Huber) loss between predicted and GT joint depths.
  3. Geometry-Aware Fine-Tuning: Activate both geometry-encoding and geometric consistency loss, and train end-to-end with total objective =2D+1s+αG\ell = \ell_{2D} + \ell_{1s} + \alpha\ell_G (using α=1.0\alpha=1.0).

Optimization employs Adam (β₁=0.9, β₂=0.999) with a learning rate of 1e-4 for initial 3D stages, decayed to 5e-5 for geometry fine-tuning. Batch size is 64; data augmentation includes random scaling (±30%), rotation (±30°), and horizontal flipping (50%). All depths are root-relative, divided by 2000 mm, then scaled to [0, 1] before entering the network.

4. Empirical Performance and Ablation Results

Quantitative evaluation on GPA demonstrates the efficacy of geometric conditioning:

Model Variant MPJPE (mm) Relative Gain (mm)
ResNet-baseline 96.6
ResNet-E 94.6 –2.0
ResNet-C 95.4 –1.2
ResNet-F 94.1 –2.5

On challenging test subsets, the full model reduces MPJPE by 5.4 mm (Occlusion: 120.5 → 115.1) and by 6.6 mm on Close-to-Geometry samples (118.1 → 111.5).

For 2D→3D “lifting” MLP baselines (denoted SIM-G), fusing both geometry features and loss decreases error from 68.2 to 64.6 mm.

Ablations indicate both geometry-encoding (“E”) and consistency-loss (“C”) yield independent improvements, with maximum gains for joints subject to occlusion (wrist, ankle, knee). Masking the RGB background increases baseline MPJPE on Close-to-Geometry from 70.7 to 78.7 mm, suggesting that networks can otherwise implicitly exploit RGB-inferred scene layout.

5. Implementation Recipe and Practical Use

A concise procedural summary for constructing an AffordPoseNet system:

  1. Precompute per-camera multi-layer depth maps DRH×W×LD\in\mathbb{R}^{H\times W\times L} for each static scene.
  2. Construct a ResNet-50 backbone to extract FR64×64×CF\in\mathbb{R}^{64\times 64\times C}.
  3. Fork into two heads:
    • Heatmap head (2D\ell_{2D})
    • Depth regression head (1s\ell_{1s})
  4. (Recommended) Concatenate person-cropped, rescaled MDM DD at matching resolution to spatial feature map.
  5. At training, for each predicted joint (xj,yj,zj)(x^j, y^j, z^j), lookup corresponding MDM ray Di(xj,yj)D_i(x^j,y^j) and accumulate consistency loss G\ell_G.
  6. Train in three stages using Adam optimizer.
  7. At inference, MDM is required if geometry encoding is used; otherwise, models trained only with geometric loss need no additional input.

Adaptation to other datasets, or more complex gating/fusion schemes for MDM features, is possible.

6. Significance and Research Context

AffordPoseNet provides a systematic framework to enforce plausible human pose estimation in cluttered, structured environments, leveraging explicit geometric priors beyond standard RGB-based inference. It operationalizes the use of multi-layer depth maps for both direct feature-level conditioning and geometric supervision via a differentiable no-penetration loss. The approach yields robust improvements, especially in occluded or “affordance-critical” scenarios, and highlights the value of curated datasets aligning human activity with precisely-measured 3D scenes.

Results confirm that geometry-informed constraints can complement generic deep pose estimation pipelines, setting a precedent for future work in affordance-aware perception, embodied reasoning, and complex scene understanding from vision (Wang et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AffordPoseNet.