Planning with Diffusion for Flexible Behavior Synthesis
Planning with Diffusion
Most trajectory optimization techniques require knowledge of dynamics . It is typically learned, and plugged into a conventional planning routine. This has a serious problem: inaccurate dynamics let a planner to take advantage of it.
The iterative denoising process of a diffusion model: sampling from perturbed distribution of the form
can contain information about prior evidence.
Generative model for trajectory planning
- Temporal ordering: Diffuser predicts all timesteps of a plan concurrently
- Temporal locality: Each step of the denoising process can only make predictions based on local consistency of the trajectory. Local consistency -> global coherence
- Trajectory representation: States and actions are predicted jointly:
- Architecture: 1-d convolution and U-Nets
- Training:
\mathcal{L}(\theta) = \mathbb[E]_{i, \epsilon, \tau^0}\[ \| \epsilon - \epsilon_\theta(\tau^i, i) \|^2 \]
reverse process covariances follow the cosine schedule of Nichol & Dhariwal (2021)
Reinforcement Learning as Guided Sampling
Procedures:
- Train a diffusion model on all available trajectory data
- Train a separate model to predict the cumulative rewards of trajectory samples
Goal-Conditioned RL as Inpainting
Some planning problems are naturally posed as constraint satisfaction problem (e.g., terminating at a goal location).
Properties of Diffusion Planners
The perturbation function required for this task is a Dirac delta for observed values and constant elsewhere.
- : state constraint at timestep
Experiments
Maze2D
- sparse reward
- Inpainting to condition on both start and goal
- Multi2D: multi-task version where the goal location changes at the beggining of each episode Block stacking
- All methods are trained on 10,000 trajectories from demonstrations generated by PDDLStream
- Sparse reward