CLIPort: What and Where Pathways for Robotic Manipulation
Project Webpage: https://cliport.github.io/
- A two-stream architecture with semantic and spatial pathways for vision-based manipulation
- The problem formulation is identical to Transporter, except that they condition it on language instructions
- Consider a table top manipulation task as a series of pick and place predictions
- Use representations that are equivariant to translations and rotations
- The novelity lies in the proposed two-stream architecture that fuses information from CLIP and Transporter Networks
Input / Output
- : top-down visual observation (RGB-D)
- : Language instruction
- : positions and orientation () on the top-down view for pick and place
An expert demonstration is:
- : predicted pixel position for pick
What if pick orientation matters?:
In their parallel gripper experiment, they
separate the pick module into two components: locator and rotator. ... The rotator takes a crop of the observation at along with the language input and predicts a discrete rotation angle by selecting from one of rotated crops.
Note that they use suction gripper in their main experiment, which can ignore pick orientation.
- denotes cross-correlation. This is exactly what Transporter does
- : crops a patch from centered at
- : placement pose . is discretized into angles
In practice, the cropped patch is rotated in ways before the query network, and cross-correlation is computed for each.
: These are implemented as the same two-stream NN architecture (described below).
Two stream architecture details
Identical to the original Transporter.
Output of hidden layer: is merged into the semantic stream as described below.
The language instruction :
Top-down observation (excluding depth as CLIP cannot handle it):
- : element-wise product
- concatenation happens along with channel dimenstion (spatial dims are identical)
- 1 x 1 conv is immediately applied to bring it back to
What is the output of this two-stream network?
Since are implemented as this two-stream network, the output is always , where matches observation size if for and , and cropped observation size if it is for .