CLIPort: What and Where Pathways for Robotic Manipulation

Apr 25, 2022

Project Webpage: https://cliport.github.io/

Summary

Language-conditioned imitation learning built on top of CLIP and Transporter Nets (hence CLIPort).

A two-stream architecture with semantic and spatial pathways for vision-based manipulation
The problem formulation is identical to Transporter, except that they condition it on language instructions
- Consider a table top manipulation task as a series of pick and place predictions
- Use representations that are equivariant to translations and rotations
The novelity lies in the proposed two-stream architecture that fuses information from CLIP and Transporter Networks

Input / Output

(o_t, l_t) \rightarrow \text{[CLIPort policy]} \rightarrow (\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}) = {a}_t \in \mathcal{A}

$o_t$ : top-down visual observation (RGB-D)
$l_t$ : Language instruction
$\mathcal{T}_\text{pick}, \mathcal{T}_\text{place}$ : positions and orientation ( $x, y, \theta$ ) on the top-down view for pick and place

An expert demonstration $\zeta_i$ is:

\zeta_i = \{(o_1, l_1, a_1), (o_2, l_2, a_2), ~\ldots~\}

Models

Pick Network

\mathcal{T}_\text{pick} = \text{argmax}_{(u,v)}~Q_\text{pick} ((u, v)| (o_t, l_t))

$(u, v)$ : predicted pixel position for pick

What if pick orientation matters?:
In their parallel gripper experiment, they

separate the pick module $Q_\text{pick}$ into two components: locator and rotator. ... The rotator takes a $64 \times 64$ crop of the observation at $(u, v)$ along with the language input and predicts a discrete rotation angle by selecting from one of $k$ rotated crops.

Note that they use suction gripper in their main experiment, which can ignore pick orientation.

Place Network

\mathcal{T}_\text{place} = \text{argmax}_{\Delta \tau}~Q_\text{place} (\Delta \tau | (o_t, l_t), \mathcal{T}_\text{pick}),

where

Q_\text{place}(\Delta \tau | (o_t, l_t), \mathcal{T}_\text{pick}) = ({\color{blue} \Phi_\text{query} (o'_t, l_t)} * {\color{green} \Phi_\text{key}(o_t, l_t)})[\Delta \tau]

{\color{blue} o'_t} = \text{Crop}(o_t, \mathcal{T}_\text{pick})

$*$ denotes cross-correlation. This is exactly what Transporter does
$\text{Crop}(o_t, \mathcal{T}_\text{pick})$ : crops a $c \times c$ patch from $o_t$ centered at $\mathcal{T}_\text{pick}$
$\Delta \tau$ : placement pose $(x, y, \theta)$ . $\theta$ is discretized into $k = 36$ angles

In practice, the cropped patch is rotated in $k = 36$ ways before the query network, and cross-correlation is computed for each.

Architecture

$f_\text{pick}, \Phi_\text{query}, \Phi_\text{key}$ : These are implemented as the same two-stream NN architecture (described below).

Two stream architecture details

![architecture](/media/posts/cliport/architecture.png =950x)

Spatial stream

Identical to the original Transporter.
Output of hidden layer: $d^{(l)}_t \in \mathbb{R}^{h \times w \times C'}$ is merged into the semantic stream as described below.

Semantic stream

The language instruction $l_t$ :

l_t \rightarrow \text{[CLIP sentence encoder]} \rightarrow g_t \in \mathbb{R}^{1 \times 1 \times C} \rightarrow \text{[tile spatially]} \rightarrow g_t^{(l)} \in \mathbb{R}^{h \times w \times C}

Top-down observation $\tilde{o}_t$ (excluding depth as CLIP cannot handle it):

\tilde{o}_t \rightarrow \text{[CLIP vis. enc.]} \rightarrow v_t^{(0)} \in \mathbb{R}^{7 \times 7 \times 2048} \rightarrow \text{[dec. layers]}

\rightarrow v_t^{(l)} \in \mathbb{R}^{h \times w \times C} \rightarrow {\color{blue} [v_t^{(l)} \odot g_t^{(l)}; d_t^{(l)}]} \rightarrow \text{[dec. layer]} \rightarrow v_t^{(l+1)} \rightarrow \cdots

$\odot$ : element-wise product
concatenation happens along with channel dimenstion (spatial dims $(h \times w)$ are identical)
- 1 x 1 conv is immediately applied to bring it back to $\mathbb{R}^{h\times w \times C}$

What is the output of this two-stream network?
Since $f_\text{pick}, \Phi_\text{query}, \Phi_\text{key}$ are implemented as this two-stream network, the output is always $\mathbb{R}^{H \times W \times 1}$ , where $H, W$ matches observation size if for $f_\text{pick}$ and $\Phi_\text{query}$ , and cropped observation size if it is for $\Phi_\text{key}$ .