Planning with Diffusion for Flexible Behavior Synthesis

Planning with Diffusion

Most trajectory optimization techniques require knowledge of dynamics f\mathbf{f}. It is typically learned, and plugged into a conventional planning routine. This has a serious problem: inaccurate dynamics let a planner to take advantage of it.

The iterative denoising process of a diffusion model: sampling from perturbed distribution of the form

p~θ(τ)pθ(τ)h(τ)\tilde{p}_\theta(\tau) \propto p_\theta(\tau) h(\tau)

h(τ)h(\tau) can contain information about prior evidence.

Generative model for trajectory planning

  • Temporal ordering: Diffuser predicts all timesteps of a plan concurrently
  • Temporal locality: Each step of the denoising process can only make predictions based on local consistency of the trajectory. Local consistency -> global coherence
  • Trajectory representation: States and actions are predicted jointly:
τ=[s0s1sTa0a1aT]\begin{align} \tau &= \begin{bmatrix} s_0 & s_1 & \ldots & s_T \\ a_0 & a_1 & \ldots & a_T \end{bmatrix} \end{align}
  • Architecture: 1-d convolution and U-Nets
  • Training:
\mathcal{L}(\theta) = \mathbb[E]_{i, \epsilon, \tau^0}\[ \| \epsilon - \epsilon_\theta(\tau^i, i) \|^2 \]

reverse process covariances Σi\Sigma^i follow the cosine schedule of Nichol & Dhariwal (2021)

Reinforcement Learning as Guided Sampling

pθ(τi1τi,O1:T)N(τi1;μ+Σg,Σ)p_\theta (\tau^{i-1} | \tau^i, \mathcal{O}_{1:T}) \sim \mathcal{N}(\tau^{i-1}; \mu + \Sigma g, \Sigma)
g=τlogp(O1:Tτ)τ=μ=Σt=0Tst,atr(st,at)(st,at)=μt=J(μ)g = \nabla_\tau \log p(\mathcal{O}_{1:T} | \tau)_{\tau=\mu} = \Sigma_{t=0}^{T} \nabla_{s_t, a_t} r(s_t, a_t) |_{(s_t, a_t) = \mu_t} = \nabla \mathcal{J}(\mu)

Procedures:

  1. Train a diffusion model pθ(τ)p_\theta(\tau) on all available trajectory data
  2. Train a separate model Jϕ\mathcal{J}_\phi to predict the cumulative rewards of trajectory samples τi\tau^i

Goal-Conditioned RL as Inpainting

Some planning problems are naturally posed as constraint satisfaction problem (e.g., terminating at a goal location).

Properties of Diffusion Planners

The perturbation function required for this task is a Dirac delta for observed values and constant elsewhere.

h(τ)=δct(s0,a0,,sT,aT)={+ifct=st0otherwiseh(\tau) = \delta_{c_t}(s_0, a_0, \ldots, s_T, a_T) = \begin{cases} +\infty if c_t=s_t 0 otherwise \end{cases}
  • ctc_t: state constraint at timestep tt

Experiments

Maze2D

  • sparse reward
  • Inpainting to condition on both start and goal
  • Multi2D: multi-task version where the goal location changes at the beggining of each episode Block stacking
  • All methods are trained on 10,000 trajectories from demonstrations generated by PDDLStream
  • Sparse reward