Learning Grasping Interaction with 3D Geometry-aware Representations
Xinchen Yan*, Mohi Khansari+, Yunfei Bai+, Jasmine Hsu#, Arkanath Pathakx, Abhinav Gupta&, James Davidson#, Honglak Lee#
#Google Brain, +X Inc, &Google Research, xGoogle, *University of Michigan
Learning to interact with objects in the environment is a fundamental AI problem involving perception, motion planning, and control. However, learning representations of such interactions is very challenging due to a high dimensional state space, difficulty in collecting large-scale data, and many variations of an object’s visual appearance (i.e. geometry, material, texture, and illumination). We argue that knowledge of 3D geometry is at the heart of grasping interactions and propose the notion of a geometry-aware learning agent. Our key idea is constraining and regularizing interaction learning through 3D geometry prediction. Specifically, we formulate the learning process of a geometry-aware agent as a two-step procedure: First, the agent learns to construct its geometry-aware representation of the scene from 2D sensory input via generative 3D shape modeling. Finally, it learns to predict grasping outcome with its built-in geometry-aware representation. The geometry-aware representation plays a key role in relating geometry and interaction via a novel learning-free depth projection layer. Our contributions are threefold: (1) we build a grasping dataset from demonstrations in virtual reality (VR*) with rich sensory and interaction annotations; (2) we demonstrate that the learned geometry-aware representation results in a more robust grasping outcome prediction compared to a baseline model; and (3) we demonstrate the benefits of the learned geometry-aware representation in grasping planning.
* We use pybullet for VR and simulation (http://pybullet.org).
We develop a two-stage learning framework that performs 3D shape prediction and grasping outcome prediction with geometry-aware representation. Being able to generate 3D object shapes (e.g., volumetric representation) from any scene given 2D sensory input is a very important feature of our geometry-aware agent. More specifically, in our formulation, the geometry-aware representation is (1) an occupancy grid representation of the scene centered at camera target in the world frame and (2) invariant to camera viewpoint and distance.
Our geometry-aware encoder-decoder network has two components: a 3D shape generation network (generative) and a grasping outcome prediction network (predictive). The shape generation network has a 2D convolutional shape encoder and a 3D deconvolutional shape decoder followed by a global projection layer. The outcome prediction network has a 2D convolutional state encoder and a fully connected outcome predictor with an additional local shape projection layer.
A short video demo on experimental results (please click on the figure below):
We thank Brain Research, Brain Robotics and X colleagues for their help to this project!