HOComp: Interaction-Aware Human-Object Composition

Dong Liang^1,2, Jinyuan Jia^1,3,†, Yuhao Liu^2,†, Rynson W.H. Lau^2,†

¹Tongji University, ²City University of Hong Kong, ³HKUST(GZ)

†: Joint Corresponding Authors

arXiv Code

We propose a new approach for interaction-aware human-object composition, named HOComp, which focuses on seamlessly integrating a foreground object onto a human-centric background image while ensuring harmonious interactions and preserving the visual consistency of both the foreground object and the background person.

Framework

We introduce HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances.

HOComp incorporates two innovative designs: MLLMs-driven region-based pose guidance (MRPG) for constraining human-object interaction via a coarse-to-fine strategy, and detail-consistent appearance preservation (DCAP) for maintaining consistent foreground/background appearances.
We introduce the Interaction-aware Human-Object Composition (IHOC) dataset, and conduct extensive experiments on this dataset to demonstrate the superiority of our method.

Inference Phase (left): MRPG uses MLLMs to generate a text prompt C, object box B_o and interaction region B_r. Among these, B_r and C are encoded and, together with the object ID, detail features, and background features, are used to condition the DiT for final composition generation. Training Phase (right): MRPG constrains the interaction by applying a pose-guided loss L_pose with keypoint supervision. DCAP enforces appearance consistency via: (1) shape-aware attention modulation to adjust the attention maps to follow the object’s shape prior M_shape; (2) a multi-view appearance loss L_appearance to semantically align synthesized and input foregrounds (multi-views); and (3) a background loss L_background to preserve original background details.

Diversity

Our method facilitates the generation of diverse and realistic results. By automatically identifying the target region and generating a suitable text prompt, it guides the interaction between the foreground object and background. This approach allows for the creation of visually harmonious interactions while maintaining a level of diversity, ensuring that different contexts and scenarios can be effectively composed.

More Results

Applications

By integrating with an Image-to-Video (I2V) model, our approach can support applications like human-product demonstration video generation, where dynamic interactions between people and objects are synthesized with remarkable fluidity and coherence.

BibTeX


@article{liang2025hocomp,
  title={HOComp: Interaction-Aware Human-Object Composition},
  author={Dong Liang and Jinyuan Jia and Yuhao Liu and Rynson W. H. Lau},
  journal={arXiv preprint arXiv:2507.16813},
  year={2025}
}

Website borrowed from NeRFies under a Creative Commons Attribution-ShareAlike 4.0 International