The (Un)surprising Effectiveness of Pre-trained Vision Models for Control

A fundamaental question:

Can we make a single vision model, pre-trained entirely on out-of-domain datasets, work for different control tasks?


  • A model pretrained on completely out-of-domain datasets can be competitive or even better than ground-truth state to learn policies (imitation learning)
  • Self-supervised learning provides better features for control policies
    • Crop augmentations seem more important in SSL compared to color augmentations
  • Early conv layer features are better for MuJoCo (fine-grained control), while later conv layer features a better for semantic tasks (Habitat)
  • Combining features from multiple layers can be competitive or outperform ground-truth state features


Habitat_ImageNav_ task:

  • a photo-realistic navigation environment.
  • The agent is given a pair of images at each timestep (current view and target location)

DeepMind Control Suite:

  • A collection of control envrionments simulated in MuJoCo


  • 28-DoF anthromorphic hand manipulation
  • Relocate and Reorient Pen

Franka Kitchen

  • Franka arm to perform various tasks


  • ResNet
    • pre-trained on ImageNet
  • Momentum Contrast (MoCo)
    • self-supervised method
    • performs instance discrimination
    • uses multiple data augmentations (crop, horizontal flip, color jitter)
  • CLIP
  • Random Features
  • From Scratch
  • Ground-Truth Features

Imitation Learning

Habitat --> use native solver 10000 trajectories per scene LSTM is used to maintain observation history

MuJoCo --> use a policy trained using state