InvSlotGNN

InvSlotGNN: Unsupervised Discovery of Viewpoint Invariant Multi-Object Representations and Visual Dynamics

Abstract

Learning multi-object dynamics from visual data using unsupervised techniques is challenging due to the need for robust object representations that can be learned through robot interactions. In previous work, we introduced a framework with two architectures: SlotTransport for discovering object-centric representations from singleview RGB images, referred to as slots, and SlotGNN for predicting scene dynamics from singleview RGB images and robot interactions using the discovered slots.

This paper extends our previous work by introducing InvSlotGNN, a novel framework designed for learning multiview slot discovery and dynamics that are invariant to the camera viewpoint:

We demonstrate that SlotTransport can be trained on multiview observations such that a single model discovers temporally aligned, object-centric representations from a wide range of different camera angles within the scene. These slots robustly bind to distinct objects from various viewpoints, even under occlusion or absence.
We introduce InvSlotGNN, an extension of SlotGNN, that effectively learns multi-object scene dynamics invariant to the camera angle and predicts the future state of multi-object scenes from observations taken by uncalibrated cameras. InvSlotGNN learns a graph representation of the scene using the discovered slots from SlotTransport and performs relational and spatial reasoning to predict the future appearance of each slot for arbitrary viewpoints, conditioned on robot actions.

We demonstrate the effectiveness of SlotTransport in learning multiview object-centric features that accurately encode visual and positional information. Furthermore, we highlight the accuracy of InvSlotGNN in downstream robotic tasks, including long-horizon prediction and multi-object rearrangement. Finally, our approach proves effective in real-world applications. With minimal real data, our framework robustly predicts slots and their dynamics in real-world multiview scenarios.