Given a single image and driving keypoints, our method can synthesize 4D point maps by unprojecting the generated depth maps. We demonstrate the results on various in-the-wild images as input.
Your browser does not support iframes. Please use a modern browser.
We demonstrate that the generated human-centric modalities can be readily applied to re-lighting renderer such as DiffusionRenderer
We evaluate our approach by running state-of-the-art estimation models directly on the RGB videos generated by HumanAnything. Across all modalities, our method produces substantially higher-fidelity human-centric outputs than existing video estimators, and delivers more temporally consistent predictions than specialized human-centric foundation models such as Sapiens
Given an ambiguity between human and objects, ours are more inclined toward human-centric reconstruction of intrinsics compared to inverse renderer of DiffusionRenderer, which naturally leads ours for human-centric relighting.