Project Webpage: https://cliport.github.io/
Summary
Language-conditioned imitation learning built on top of CLIP and Transporter Nets (hence CLIPort).
- A two-stream architecture with semantic and spatial pathways for vision-based manipulation
- The problem formulation is identical to Transporter, except that they condition it on language instructions
- Consider a table top manipulation task as a series of pick and place predictions
- Use representations that are equivariant to translations and rotations
- The novelity lies in the proposed two-stream architecture that fuses information from CLIP and Transporter Networks
Input / Output
(ot,lt)→[CLIPort policy]→(Tpick,Tplace)=at∈A
- ot : top-down visual observation (RGB-D)
- lt : Language instruction
- Tpick,Tplace : positions and orientation (x,y,θ) on the top-down view for pick and place
An expert demonstration ζi is:
ζi={(o1,l1,a1),(o2,l2,a2), … }
Models
Pick Network
Tpick=argmax(u,v) Qpick((u,v)∣(ot,lt))
- (u,v): predicted pixel position for pick
What if pick orientation matters?:
In their parallel gripper experiment, they
separate the pick module Qpick into two components: locator and rotator.
...
The rotator takes a 64×64 crop of the observation at (u,v) along with the language input and predicts a discrete rotation angle by selecting from one of k rotated crops.
Note that they use suction gripper in their main experiment, which can ignore pick orientation.
Place Network
Tplace=argmaxΔτ Qplace(Δτ∣(ot,lt),Tpick),
where
Qplace(Δτ∣(ot,lt),Tpick)=(Φquery(ot′,lt)∗Φkey(ot,lt))[Δτ]
ot′=Crop(ot,Tpick)
- ∗ denotes cross-correlation. This is exactly what Transporter does
- Crop(ot,Tpick): crops a c×c patch from ot centered at Tpick
- Δτ: placement pose (x,y,θ). θ is discretized into k=36 angles
In practice, the cropped patch is rotated in k=36 ways before the query network, and cross-correlation is computed for each.
Architecture
fpick,Φquery,Φkey: These are implemented as the same two-stream NN architecture (described below).
Two stream architecture details

Spatial stream
Identical to the original Transporter.
Output of hidden layer: dt(l)∈Rh×w×C′ is merged into the semantic stream as described below.
Semantic stream
The language instruction lt:
lt→[CLIP sentence encoder]→gt∈R1×1×C→[tile spatially]→gt(l)∈Rh×w×C
Top-down observation o~t (excluding depth as CLIP cannot handle it):
o~t→[CLIP vis. enc.]→vt(0)∈R7×7×2048→[dec. layers]
→vt(l)∈Rh×w×C→[vt(l)⊙gt(l);dt(l)]→[dec. layer]→vt(l+1)→⋯
- ⊙: element-wise product
- concatenation happens along with channel dimenstion (spatial dims (h×w) are identical)
- 1 x 1 conv is immediately applied to bring it back to Rh×w×C
What is the output of this two-stream network?
Since fpick,Φquery,Φkey are implemented as this two-stream network, the output is always RH×W×1, where H,W matches observation size if for fpick and Φquery, and cropped observation size if it is for Φkey.