💁‍♂️HumanAnything: Spatially-Aligned Multi-Modal Video Diffusion for Human-Centric Generation

Anonymous Authors

HumanAnything jointly generates RGB, depth, normal, segmentation, albedo, and roughness video given an image and driving keypoints as conditions.

Interactive 4D Results: Unprojecting Generated Depths

Given a single image and driving keypoints, our method can synthesize 4D point maps by unprojecting the generated depth maps. We demonstrate the results on various in-the-wild images as input.

Application to Relighting

We demonstrate that the generated human-centric modalities can be readily applied to re-lighting renderer such as DiffusionRenderer

Comparison with Estimators Applied to Our Generated RGB Videos

We evaluate our approach by running state-of-the-art estimation models directly on the RGB videos generated by HumanAnything. Across all modalities, our method produces substantially higher-fidelity human-centric outputs than existing video estimators, and delivers more temporally consistent predictions than specialized human-centric foundation models such as Sapiens

Comparison to DiffusionRenderer (Inverse renderer)

Given an ambiguity between human and objects, ours are more inclined toward human-centric reconstruction of intrinsics compared to inverse renderer of DiffusionRenderer, which naturally leads ours for human-centric relighting.