Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding

Xianqiang Gao *1,2   Pingrui Zhang *2   Delin Qu 2   Dong Wang 2   Zhigang Wang 2   Yan Ding 2  
Bin Zhao 2,3   Xuelong Li 4  
*These authors contributed equally.

Abstract


3D Object Affordance Grounding aims to predict the functional regions on a 3D object and has laid the foundation for a wide range of applications in robotics. Recent advances tackle this problem via learning a mapping between 3D regions and a single human-object interaction image. However, the geometric structure of the 3D object and the object in the human-object interaction image are not always consistent, leading to poor generalization. To address this issue, we propose to learn generalizable invariant affordance knowledge from multiple human-object interaction images within the same affordance category. Specifically, we introduce the Multi-Image Guided Invariant-Feature-Aware 3D Affordance Grounding (MIFAG) framework. It grounds 3D object affordance regions by identifying common interaction patterns across multiple human-object interaction images. First, the Invariant Affordance Knowledge Extraction Module (IAM) utilizes an iterative updating strategy to gradually extract aligned affordance knowledge from multiple images and integrate it into an affordance dictionary. Then, the Affordance Dictionary Adaptive Fusion Module (ADM) learns comprehensive point cloud representations that consider all affordance candidates in multiple images. Besides, the Multi-Image and Point Affordance (MIPA) benchmark is constructed and our method outperforms existing state-of-the-art methods on various experimental comparisons.


Motivation of Our Method. The reference human-object images exhibit significant variations in appearance, yet they consistently imply the same affordance knowledge. We propose to iteratively extract the invariant affordance knowledge from multiple images, leading to improved performance.

Framework


Overview of our proposed MIFAG. (a) The IAM utilizes a multi-layer network and a dual-branch structure to gradually extract invariant affordance knowledge and minimize interference caused by appearance variations in the images. (b) The ADM leverages the invariant affordance knowledge dictionary derived from (a), using dictionary-based cross attention and self-weighted attention to comprehensively fuse the affordance knowledge with point cloud representations.

Experimental Results


Affordance Prediction Metrics on MIPA Benchmark. Comparison of evaluation between the proposed method MIFAG and baseline methods on MIPA. MIFAG significantly surpasses existing methods and achieves SOTA performance.



Affordance Visualization on MIPA dataset. Compared with LASO (Li et al. 2024b) and IAGNet (Yang et al. 2023b), the proposed MIFAG achieves more accurate results in both seen and unseen settings.



Real-World Visualization. Left: Original 3D point clouds scanned by an iPhone 15 Pro. Middle: Reference images. Right: Affordance prediction results on the scanned point cloud.

t-SNE visualization of affordance queries. Tokens query corresponding to the same operation across different object clusters in the region.

More Results


More qualitative results of MIFAG on the setting of seen.




More qualitative results of MIFAG on the setting of unseen.



Check out our paper for more details.

Citation

@misc{gao2024learning2dinvariantaffordance,
    title={Learning 2D Invariant Affordance Knowledge for 3D Affordance Grounding}, 
    author={Xianqiang Gao and Pingrui Zhang and Delin Qu and Dong Wang and Zhigang Wang and Yan Ding and Bin Zhao and Xuelong Li},
    year={2024},
    eprint={2408.13024},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2408.13024}
}

The website template was borrowed from Frank Dou.