Planning with Diffusion for Flexible Behavior Synthesis

Jul 16, 2022

Planning with Diffusion

Most trajectory optimization techniques require knowledge of dynamics $\mathbf{f}$ . It is typically learned, and plugged into a conventional planning routine. This has a serious problem: inaccurate dynamics let a planner to take advantage of it.

The iterative denoising process of a diffusion model: sampling from perturbed distribution of the form

\tilde{p}_\theta(\tau) \propto p_\theta(\tau) h(\tau)

$h(\tau)$ can contain information about prior evidence.

Generative model for trajectory planning

Temporal ordering: Diffuser predicts all timesteps of a plan concurrently
Temporal locality: Each step of the denoising process can only make predictions based on local consistency of the trajectory. Local consistency -> global coherence
Trajectory representation: States and actions are predicted jointly:

\begin{align} \tau &= \begin{bmatrix} s_0 & s_1 & \ldots & s_T \\ a_0 & a_1 & \ldots & a_T \end{bmatrix} \end{align}

Architecture: 1-d convolution and U-Nets
Training:

\mathcal{L}(\theta) = \mathbb[E]_{i, \epsilon, \tau^0}\[ \| \epsilon - \epsilon_\theta(\tau^i, i) \|^2 \]

reverse process covariances $\Sigma^i$ follow the cosine schedule of Nichol & Dhariwal (2021)

Reinforcement Learning as Guided Sampling

p_\theta (\tau^{i-1} | \tau^i, \mathcal{O}_{1:T}) \sim \mathcal{N}(\tau^{i-1}; \mu + \Sigma g, \Sigma)

g = \nabla_\tau \log p(\mathcal{O}_{1:T} | \tau)_{\tau=\mu} = \Sigma_{t=0}^{T} \nabla_{s_t, a_t} r(s_t, a_t) |_{(s_t, a_t) = \mu_t} = \nabla \mathcal{J}(\mu)

Procedures:

Train a diffusion model $p_\theta(\tau)$ on all available trajectory data
Train a separate model $\mathcal{J}_\phi$ to predict the cumulative rewards of trajectory samples $\tau^i$

Goal-Conditioned RL as Inpainting

Some planning problems are naturally posed as constraint satisfaction problem (e.g., terminating at a goal location).

Properties of Diffusion Planners

The perturbation function required for this task is a Dirac delta for observed values and constant elsewhere.

h(\tau) = \delta_{c_t}(s_0, a_0, \ldots, s_T, a_T) = \begin{cases} +\infty if c_t=s_t 0 otherwise \end{cases}

$c_t$ : state constraint at timestep $t$

Experiments

Maze2D

sparse reward
Inpainting to condition on both start and goal
Multi2D: multi-task version where the goal location changes at the beggining of each episode Block stacking
All methods are trained on 10,000 trajectories from demonstrations generated by PDDLStream
Sparse reward