3D Neural Scene Representations for Visomotor Control
Paper: https://arxiv.org/abs/2107.04004
Summary
- Learn viewpoint-invariant scene representations with NeRF & TCN
- The learned 3D representations perform well on visuomotor control tasks
- Main selling point: This method can cope with a new viewpoint by incorporating 3D representation!
In sentences:
3D Neural Scene Representations for Visuomotor Control
A novel approach to represent a 3D scene with NeRF and Time Contrastive loss.
The trained encoder coupled with "auto-decoding trick" is shown to produce representations that generalizes well to unseen viewpoints.
Experiments are performed on water-pouring task where the goal configuration is given by an image from an unseen viewpoint.
Key related works:
NeRF, TCN
Main experiment settings
Tasks
FluidPour
- Pouring water with a robot arm
- 20 cameras are set surrounding the robot
- Goal: reach a goal configuration (specific water-pouring pose)
- The goal configuration is given as images from (1) a viewpoint during training, (2) interpolated viewpoint between training viewpoints, and (3) an extrapolated viewpoint
They also have FluidShake, RigidStack and RigidDrop.
- The goal configuration is given as images from (1) a viewpoint during training, (2) interpolated viewpoint between training viewpoints, and (3) an extrapolated viewpoint
Overview
Methods
Training objective
: L2 loss between two consecutive states given actionL_{\text{dyn}} : max-margin loss across timeL_{\text{tc}} : NeRF loss that depends on state representationL_{\text{rec}} s_t
Training procedure:
- Train
andf_\text{enc} together by minimizingf_\text{dec} andL_\text{rec} L_\text{tc} - Fix encoder params, and train the dynamics model
f_\text{dyn}
A trick: Inference by Optimization (i.e., Auto-decoding)
Apply the inference-by-optimization (i.e., auto-decoding) framework that backprop through the volumetric renderer and the neural implicit representation into the state estimate.
This is inspired by the fact that the rendering function,
is viewpoint equivariant,... f_\text{dec}(x, d, s_t) = (\sigma_t, c_t)
Basically
They leverage this and optimized
Note that this only updates
The resulting state
Not sure
- How does NeRF take an extra state vector in addition to
??(x, y, z, \theta, \phi) - Should read pixel-NeRF stuff