CLIPort: What and Where Pathways for Robotic Manipulation
Project Webpage: https://cliport.github.io/
Summary
Language-conditioned imitation learning built on top of CLIP and Transporter Nets (hence CLIPort).
- A two-stream architecture with semantic and spatial pathways for vision-based manipulation
- The problem formulation is identical to Transporter, except that they condition it on language instructions
- Consider a table top manipulation task as a series of pick and place predictions
- Use representations that are equivariant to translations and rotations
- The novelity lies in the proposed two-stream architecture that fuses information from CLIP and Transporter Networks
Input / Output
: top-down visual observation (RGB-D)o_t : Language instructionl_t : positions and orientation (\mathcal{T}_\text{pick}, \mathcal{T}_\text{place} ) on the top-down view for pick and placex, y, \theta
An expert demonstration
Models
Pick Network
: predicted pixel position for pick(u, v)
What if pick orientation matters?:
In their parallel gripper experiment, they
separate the pick module
into two components: locator and rotator. Q_\text{pick}
...
The rotator takes acrop of the observation at 64 \times 64 along with the language input and predicts a discrete rotation angle by selecting from one of (u, v) rotated crops. k
Note that they use suction gripper in their main experiment, which can ignore pick orientation.
Place Network
where
denotes cross-correlation. This is exactly what Transporter does* : crops a\text{Crop}(o_t, \mathcal{T}_\text{pick}) patch fromc \times c centered ato_t \mathcal{T}_\text{pick} : placement pose\Delta \tau .(x, y, \theta) is discretized into\theta anglesk = 36
In practice, the cropped patch is rotated in
Architecture
Two stream architecture details
Spatial stream
Identical to the original Transporter.
Output of hidden layer:
Semantic stream
The language instruction
Top-down observation
: element-wise product\odot - <span style="color:blue">concatenation</span> happens along with channel dimenstion (spatial dims
are identical)(h \times w) - 1 x 1 conv is immediately applied to bring it back to
\mathbb{R}^{h\times w \times C}
- 1 x 1 conv is immediately applied to bring it back to
What is the output of this two-stream network?
Since