Language-conditioned imitation learning built on top of CLIP and Transporter Nets (hence CLIPort).
A two-stream architecture with semantic and spatial pathways for vision-based manipulation
The problem formulation is identical to Transporter, except that they condition it on language instructions
Consider a table top manipulation task as a series of pick and place predictions
Use representations that are equivariant to translations and rotations
The novelity lies in the proposed two-stream architecture that fuses information from CLIP and Transporter Networks
Input / Output
(ot,lt)→[CLIPort policy]→(Tpick,Tplace)=at∈A
ot : top-down visual observation (RGB-D)
lt : Language instruction
Tpick,Tplace : positions and orientation (x,y,θ) on the top-down view for pick and place
An expert demonstration ζi is:
ζi={(o1,l1,a1),(o2,l2,a2),…}
Models
Pick Network
Tpick=argmax(u,v)Qpick((u,v)∣(ot,lt))
(u,v): predicted pixel position for pick
What if pick orientation matters?:
In their parallel gripper experiment, they
separate the pick module Qpick into two components: locator and rotator.
...
The rotator takes a 64×64 crop of the observation at (u,v) along with the language input and predicts a discrete rotation angle by selecting from one of k rotated crops.
Note that they use suction gripper in their main experiment, which can ignore pick orientation.
concatenation happens along with channel dimenstion (spatial dims (h×w) are identical)
1 x 1 conv is immediately applied to bring it back to Rh×w×C
What is the output of this two-stream network?
Since fpick,Φquery,Φkey are implemented as this two-stream network, the output is always RH×W×1, where H,W matches observation size if for fpick and Φquery, and cropped observation size if it is for Φkey.