# Learning Grasping Interaction with 3D Geometry-aware Representations

### Authors

Xinchen Yan^{*}, Mohi Khansari^{+}, Yunfei Bai^{+}, Jasmine Hsu^{#}, Arkanath Pathak^{x},
Abhinav Gupta^{&}, James Davidson^{#}, Honglak Lee^{#}

#### Affliation

^{#}Google Brain,
^{+}X Inc,
^{&}Google Research,
^{x}Google,
^{*}University of Michigan

### Resources

### Abstract

Learning to interact with objects in the environment is a fundamental AI problem involving perception, motion planning, and control.
However, learning representations of such interactions is very challenging due to a high dimensional state space, difficulty in collecting large-scale data, and many variations of an objectâ€™s visual appearance (i.e. geometry, material, texture, and illumination).
We argue that knowledge of 3D geometry is at the heart of grasping interactions and propose the notion of a geometry-aware learning agent.
Our key idea is constraining and regularizing interaction learning through 3D geometry prediction.
Specifically, we formulate the learning process of a geometry-aware agent as a two-step procedure:
First, the agent learns to construct its geometry-aware representation of the scene from 2D sensory input via generative 3D shape modeling.
Finally, it learns to predict grasping outcome with its built-in geometry-aware representation.
The geometry-aware representation plays a key role in relating geometry and interaction via a novel learning-free depth projection layer.
Our contributions are threefold:
(1) we build a grasping dataset from demonstrations in virtual reality (VR^{*}) with rich sensory and interaction annotations;
(2) we demonstrate that the learned geometry-aware representation results in a more robust grasping outcome prediction compared to a baseline model; and
(3) we demonstrate the benefits of the learned geometry-aware representation in grasping planning.

^{*} We use pybullet for VR and simulation (http://pybullet.org).

### Approach

We develop a two-stage learning framework that performs 3D shape prediction and grasping outcome prediction with geometry-aware representation. Being able to generate 3D object shapes (e.g., volumetric representation) from any scene given 2D sensory input is a very important feature of our geometry-aware agent. More specifically, in our formulation, the geometry-aware representation is (1) an occupancy grid representation of the scene centered at camera target in the world frame and (2) invariant to camera viewpoint and distance.

### Model Architecture

Our geometry-aware encoder-decoder network has two components: a 3D shape generation network (generative) and a grasping outcome prediction network (predictive). The shape generation network has a 2D convolutional shape encoder and a 3D deconvolutional shape decoder followed by a global projection layer. The outcome prediction network has a 2D convolutional state encoder and a fully connected outcome predictor with an additional local shape projection layer.

### Experiments

A short video demo on experimental results (please click on the figure below):

### Acknowledement

We thank Brain Research, Brain Robotics and X colleagues for their help to this project!