InvSlotGNN: Unsupervised Discovery of Viewpoint Invariant Multi-Object Representations and Visual Dynamics


Abstract

Learning multi-object dynamics from visual data using unsupervised techniques is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work, we introduced a framework with two architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots.

This paper extends our previous work by introducing InvSlotGNN, a novel framework designed for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint:

We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multi-object rearrangement. Finally, our approach proves effective in real-world applications. With minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.


Results

  • Multiview Slot Discovery:

    I_gt I_rec Pred Mask Recon Slots

  • Long-Horizon Dynamics Rollout:

    I_gt       I_rec       Recon Slots
  • Multiview Slot Discovery:

    I_gt I_rec Pred Mask Recon Slots

  • Action Planning:


Goal I_current Pred I_next     Pred Mask  Recon Slots