3D Understanding of the environment is critical for the robustness and performance of robot learning systems. As an example, 2D image-based policies can easily fail due to a slight change in camera viewpoints. However, when constructing a 3D representation, previous approaches often either sacrifice the rich semantic abilities of 2D foundation models or a fast update rate that is crucial real-time robotic manipulation. In this work, we propose a 3D representation based on 3D Gaussians that is both semantic and dynamic. With only a single or a few camera views, our proposed representation is able to capture a dynamic scene at 30 Hz in real-time in response to robot and object movements, which is sufficient for most manipulation tasks. Our key insight in achieving this fast update frequency is to make object-centric updates to the representation. Semantic information can be extracted at the initial step from pretrained foundation models, thus circumventing the inference bottleneck of large models during policy rollouts. Leveraging our object-centric Gaussian representation, we demonstrate a straightforward yet effective way to achieve view-robustness for visuomotor policies. Our representation also enables language-conditioned dynamic grasping, for which the robot perform geometric grasp of moving objects specified by open vocabulary queries.
From a test view that is distinct from the train view, we render the scene from the train view to close the perception gap. We further limit the rendering to the foreground Gaussians to avoid perception differences due to the changes in the background.
Below, we show clips of the raw camera view at top right corner, and the rendered view using foreground Gaussians at the top left corner.
With object-centric gaussian splatting, language-conditioned dynamic grasping becomes a straightforward application. Since the representation is 3D, we can directly compute grasp poses with geometric optimization. Because the representation is semantic, we can generate segmentation masks and clip embeddings on the fly. Finally, the representation is also dynamic, so we can also handle fast moving objects.