CLIPort: What and Where Pathways for Robotic Manipulation

Project Webpage:


Language-conditioned imitation learning built on top of CLIP and Transporter Nets (hence CLIPort).

  • A two-stream architecture with semantic and spatial pathways for vision-based manipulation
  • The problem formulation is identical to Transporter, except that they condition it on language instructions
    • Consider a table top manipulation task as a series of pick and place predictions
    • Use representations that are equivariant to translations and rotations
  • The novelity lies in the proposed two-stream architecture that fuses information from CLIP and Transporter Networks

Input / Output

(ot,lt)[CLIPort policy](Tpick,Tplace)=atA(o_t, l_t) \rightarrow \text{[CLIPort policy]} \rightarrow (\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}) = {a}_t \in \mathcal{A}
  • oto_t : top-down visual observation (RGB-D)
  • ltl_t : Language instruction
  • Tpick,Tplace\mathcal{T}_\text{pick}, \mathcal{T}_\text{place} : positions and orientation (x,y,θx, y, \theta) on the top-down view for pick and place

An expert demonstration ζi\zeta_i is:

ζi={(o1,l1,a1),(o2,l2,a2),  }\zeta_i = \{(o_1, l_1, a_1), (o_2, l_2, a_2), ~\ldots~\}


Pick Network

Tpick=argmax(u,v) Qpick((u,v)(ot,lt))\mathcal{T}_\text{pick} = \text{argmax}_{(u,v)}~Q_\text{pick} ((u, v)| (o_t, l_t))
  • (u,v)(u, v): predicted pixel position for pick

What if pick orientation matters?:
In their parallel gripper experiment, they

separate the pick module QpickQ_\text{pick} into two components: locator and rotator. ... The rotator takes a 64×6464 \times 64 crop of the observation at (u,v)(u, v) along with the language input and predicts a discrete rotation angle by selecting from one of kk rotated crops.

Note that they use suction gripper in their main experiment, which can ignore pick orientation.

Place Network

Tplace=argmaxΔτ Qplace(Δτ(ot,lt),Tpick),\mathcal{T}_\text{place} = \text{argmax}_{\Delta \tau}~Q_\text{place} (\Delta \tau | (o_t, l_t), \mathcal{T}_\text{pick}),


Qplace(Δτ(ot,lt),Tpick)=(Φquery(ot,lt)Φkey(ot,lt))[Δτ]Q_\text{place}(\Delta \tau | (o_t, l_t), \mathcal{T}_\text{pick}) = ({\color{blue} \Phi_\text{query} (o'_t, l_t)} * {\color{green} \Phi_\text{key}(o_t, l_t)})[\Delta \tau]
ot=Crop(ot,Tpick){\color{blue} o'_t} = \text{Crop}(o_t, \mathcal{T}_\text{pick})
  • * denotes cross-correlation. This is exactly what Transporter does
  • Crop(ot,Tpick)\text{Crop}(o_t, \mathcal{T}_\text{pick}): crops a c×cc \times c patch from oto_t centered at Tpick\mathcal{T}_\text{pick}
  • Δτ\Delta \tau: placement pose (x,y,θ)(x, y, \theta). θ\theta is discretized into k=36k = 36 angles

In practice, the cropped patch is rotated in k=36k = 36 ways before the query network, and cross-correlation is computed for each.


fpick,Φquery,Φkeyf_\text{pick}, \Phi_\text{query}, \Phi_\text{key}: These are implemented as the same two-stream NN architecture (described below).

Two stream architecture details

![architecture](/media/posts/cliport/architecture.png =950x)

Spatial stream

Identical to the original Transporter.
Output of hidden layer: dt(l)Rh×w×Cd^{(l)}_t \in \mathbb{R}^{h \times w \times C'} is merged into the semantic stream as described below.

Semantic stream

The language instruction ltl_t:

lt[CLIP sentence encoder]gtR1×1×C[tile spatially]gt(l)Rh×w×Cl_t \rightarrow \text{[CLIP sentence encoder]} \rightarrow g_t \in \mathbb{R}^{1 \times 1 \times C} \rightarrow \text{[tile spatially]} \rightarrow g_t^{(l)} \in \mathbb{R}^{h \times w \times C}

Top-down observation o~t\tilde{o}_t (excluding depth as CLIP cannot handle it):

o~t[CLIP vis. enc.]vt(0)R7×7×2048[dec. layers]\tilde{o}_t \rightarrow \text{[CLIP vis. enc.]} \rightarrow v_t^{(0)} \in \mathbb{R}^{7 \times 7 \times 2048} \rightarrow \text{[dec. layers]}
vt(l)Rh×w×C[vt(l)gt(l);dt(l)][dec. layer]vt(l+1)\rightarrow v_t^{(l)} \in \mathbb{R}^{h \times w \times C} \rightarrow {\color{blue} [v_t^{(l)} \odot g_t^{(l)}; d_t^{(l)}]} \rightarrow \text{[dec. layer]} \rightarrow v_t^{(l+1)} \rightarrow \cdots
  • \odot: element-wise product
  • concatenation happens along with channel dimenstion (spatial dims (h×w)(h \times w) are identical)
    • 1 x 1 conv is immediately applied to bring it back to Rh×w×C\mathbb{R}^{h\times w \times C}

What is the output of this two-stream network?
Since fpick,Φquery,Φkeyf_\text{pick}, \Phi_\text{query}, \Phi_\text{key} are implemented as this two-stream network, the output is always RH×W×1\mathbb{R}^{H \times W \times 1}, where H,WH, W matches observation size if for fpickf_\text{pick} and Φquery\Phi_\text{query}, and cropped observation size if it is for Φkey\Phi_\text{key}.