Transporter Networks


  • A simple model that learns to attend to a local region and predict its spatial displacement
  • 10 unique tabletop manipulation tasks
  • The model could achieve more than 90% success in new configurations using 100 expert demos
  • Opensource Ravens
  • Basically like a deep template matching, where you crop an observation around the object and use it as a template


A problem setting is as follows:

f(ot)at=(Tpick,Tplace)Af(o_t) \rightarrow a_t = (\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}) \in \mathcal{A}
  • Tpick,Tplace\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}: the pose of the end effector used to pick / place an object

Learning to Transport


  • Tpick,TplaceR2\mathcal{T}_\text{pick}, \mathcal{T}_\text{place} \in \mathbb{R}^2
  • Immobilizing grasp (i.e., suction gripper) These assumptions provides the following setting:
  1. Tpick\mathcal{T}_\textrm{pick} is sampled from a distribution of successful pick poses
  2. For each successful pick pose, there's a corresponding distr. of successful place pose (Tplace\mathcal{T}_\text{place})

In equation,

fpick(ot)Tpick,   fplace(ot,Tpick)\rightarowTplacef_\text{pick}(o_t) \rightarrow \mathcal{T}_\text{pick},~~~f_\text{place}(o_t, \mathcal{T}_\text{pick}) \rightarow \mathcal{T}_\text{place}

Learning picking

Tpick=argmax(u,v)Qpick((u,v)ot)\mathcal{T}_\text{pick} = \arg \max_{(u, v)} \mathcal{Q}_\text{pick}((u, v)| o_t)
  • (u,v)(u, v): pixel location --> We can map each pixel to a pick action :)
  • Q\mathcal{Q} is Fully Convolutional Network (FCN)
    • Translationally equivariant (i.e., f_\text{pick}(g \circ o_t ) = g \circ f_\text{pick}(o_t})

Spatially consistent visual representations

What is spatially consistent?: The appearance of an object remains constant across different camera views

They convert RGB-D images into a spartially consistent form by unprojecting to a 3D point cloud and then rendering into an orthographic projection.

Qplace(τot,Tpick)=ψ(ot[Tpick])ϕ(ot)[τ],   Tplace=argmaxτiQplace(τiot,Tpick)\mathcal{Q}_\text{place}(\tau| o_t, \mathcal{T}_\text{pick}) = \psi(o_t [\mathcal{T}_\text{pick}]) * \phi(o_t)[\tau],~~~\mathcal{T}_\text{place} = \arg \max_{{\tau}_i} \mathcal{Q}_\text{place}(\tau_i | o_t, \mathcal{T}_\text{pick})
  • ψ(ot[Tpick])\psi(o_t[\mathcal{T}_\text{pick}]): a dense feature of a cropped observation (template)
  • ϕ(ot)[τ]\phi(o_t)[\tau]: a dense feature of a crop at pose τ\tau. A pose here is a pixel location (search area)

:::messages Learning with Planer Rotations: SE(2) ? They discretize the rotation into kk bins, and then rotate the observation accordingly. A trick is to apply FCN kk times in parallel, for each rotated oto_t. :::