HOComp: Interaction-Aware Human-Object Composition

1Tongji University, 2City University of Hong Kong, 3HKUST(GZ)
†: Joint Corresponding Authors

We propose a new approach for interaction-aware human-object composition, named HOComp, which focuses on seamlessly integrating a foreground object onto a human-centric background image while ensuring harmonious interactions and preserving the visual consistency of both the foreground object and the background person.

Static 1 GIF
Static 2 GIF
Static 4 GIF
Static 3 GIF
Static 5 GIF
Static 6 GIF

Framework

We introduce HOComp, a novel approach for compositing a foreground object onto a human-centric background image, while ensuring harmonious interactions between the foreground object and the background person and their consistent appearances.

  • HOComp incorporates two innovative designs: MLLMs-driven region-based pose guidance (MRPG) for constraining human-object interaction via a coarse-to-fine strategy, and detail-consistent appearance preservation (DCAP) for maintaining consistent foreground/background appearances.
  • We introduce the Interaction-aware Human-Object Composition (IHOC) dataset, and conduct extensive experiments on this dataset to demonstrate the superiority of our method.
Framework Overview

Inference Phase (left): MRPG uses MLLMs to generate a text prompt C, object box Bo and interaction region Br. Among these, Br and C are encoded and, together with the object ID, detail features, and background features, are used to condition the DiT for final composition generation. Training Phase (right): MRPG constrains the interaction by applying a pose-guided loss Lpose with keypoint supervision. DCAP enforces appearance consistency via: (1) shape-aware attention modulation to adjust the attention maps to follow the object’s shape prior Mshape; (2) a multi-view appearance loss Lappearance to semantically align synthesized and input foregrounds (multi-views); and (3) a background loss Lbackground to preserve original background details.

Diversity

Our method facilitates the generation of diverse and realistic results. By automatically identifying the target region and generating a suitable text prompt, it guides the interaction between the foreground object and background. This approach allows for the creation of visually harmonious interactions while maintaining a level of diversity, ensuring that different contexts and scenarios can be effectively composed.

Static 1 GIF GIF
Static 2 GIF GIF
Static 4 GIF GIF
Static 3 GIF GIF

More Results

Applications

By integrating with an Image-to-Video (I2V) model, our approach can support applications like human-product demonstration video generation, where dynamic interactions between people and objects are synthesized with remarkable fluidity and coherence.

BibTeX


@article{liang2025hocomp,
  title={HOComp: Interaction-Aware Human-Object Composition},
  author={Dong Liang and Jinyuan Jia and Yuhao Liu and Rynson W. H. Lau},
  journal={arXiv preprint arXiv:2507.16813},
  year={2025}
}