Transporter Networks

Jul 21, 2022

Abstract

A simple model that learns to attend to a local region and predict its spatial displacement
10 unique tabletop manipulation tasks
The model could achieve more than 90% success in new configurations using 100 expert demos
Opensource Ravens
Basically like a deep template matching, where you crop an observation around the object and use it as a template

Method

A problem setting is as follows:

f(o_t) \rightarrow a_t = (\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}) \in \mathcal{A}

$\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}$ : the pose of the end effector used to pick / place an object

Learning to Transport

Assumptions:

$\mathcal{T}_\text{pick}, \mathcal{T}_\text{place} \in \mathbb{R}^2$
Immobilizing grasp (i.e., suction gripper) These assumptions provides the following setting:

$\mathcal{T}_\textrm{pick}$ is sampled from a distribution of successful pick poses
For each successful pick pose, there's a corresponding distr. of successful place pose ( $\mathcal{T}_\text{place}$ )

In equation,

f_\text{pick}(o_t) \rightarrow \mathcal{T}_\text{pick},~~~f_\text{place}(o_t, \mathcal{T}_\text{pick}) \rightarow \mathcal{T}_\text{place}

Learning picking

\mathcal{T}_\text{pick} = \arg \max_{(u, v)} \mathcal{Q}_\text{pick}((u, v)| o_t)

$(u, v)$ : pixel location --> We can map each pixel to a pick action :)
$\mathcal{Q}$ is Fully Convolutional Network (FCN)
- Translationally equivariant (i.e., $f_\text{pick}(g \circ o_t ) = g \circ f_\text{pick}(o_t}$ )

Spatially consistent visual representations

What is spatially consistent?: The appearance of an object remains constant across different camera views

They convert RGB-D images into a spartially consistent form by unprojecting to a 3D point cloud and then rendering into an orthographic projection.

\mathcal{Q}_\text{place}(\tau| o_t, \mathcal{T}_\text{pick}) = \psi(o_t [\mathcal{T}_\text{pick}]) * \phi(o_t)[\tau],~~~\mathcal{T}_\text{place} = \arg \max_{{\tau}_i} \mathcal{Q}_\text{place}(\tau_i | o_t, \mathcal{T}_\text{pick})

$\psi(o_t[\mathcal{T}_\text{pick}])$ : a dense feature of a cropped observation (template)
$\phi(o_t)[\tau]$ : a dense feature of a crop at pose $\tau$ . A pose here is a pixel location (search area)

:::messages Learning with Planer Rotations: SE(2) ? They discretize the rotation into $k$ bins, and then rotate the observation accordingly. A trick is to apply FCN $k$ times in parallel, for each rotated $o_t$ . :::