# Planning with Diffusion for Flexible Behavior Synthesis

## Planning with Diffusion

Most trajectory optimization techniques require knowledge of dynamics $\mathbf{f}$. It is typically learned, and plugged into a conventional planning routine. This has a serious problem: inaccurate dynamics let a planner to take advantage of it.

The iterative denoising process of a diffusion model: sampling from perturbed distribution of the form

$\tilde{p}_\theta(\tau) \propto p_\theta(\tau) h(\tau)$

$h(\tau)$ can contain information about prior evidence.

## Generative model for trajectory planning

• Temporal ordering: Diffuser predicts all timesteps of a plan concurrently
• Temporal locality: Each step of the denoising process can only make predictions based on local consistency of the trajectory. Local consistency -> global coherence
• Trajectory representation: States and actions are predicted jointly:
\begin{align} \tau &= \begin{bmatrix} s_0 & s_1 & \ldots & s_T \\ a_0 & a_1 & \ldots & a_T \end{bmatrix} \end{align}
• Architecture: 1-d convolution and U-Nets
• Training:
\mathcal{L}(\theta) = \mathbb[E]_{i, \epsilon, \tau^0}$\| \epsilon - \epsilon_\theta(\tau^i, i) \|^2$

reverse process covariances $\Sigma^i$ follow the cosine schedule of Nichol & Dhariwal (2021)

## Reinforcement Learning as Guided Sampling

$p_\theta (\tau^{i-1} | \tau^i, \mathcal{O}_{1:T}) \sim \mathcal{N}(\tau^{i-1}; \mu + \Sigma g, \Sigma)$
$g = \nabla_\tau \log p(\mathcal{O}_{1:T} | \tau)_{\tau=\mu} = \Sigma_{t=0}^{T} \nabla_{s_t, a_t} r(s_t, a_t) |_{(s_t, a_t) = \mu_t} = \nabla \mathcal{J}(\mu)$

Procedures:

1. Train a diffusion model $p_\theta(\tau)$ on all available trajectory data
2. Train a separate model $\mathcal{J}_\phi$ to predict the cumulative rewards of trajectory samples $\tau^i$

## Goal-Conditioned RL as Inpainting

Some planning problems are naturally posed as constraint satisfaction problem (e.g., terminating at a goal location).

## Properties of Diffusion Planners

The perturbation function required for this task is a Dirac delta for observed values and constant elsewhere.

$h(\tau) = \delta_{c_t}(s_0, a_0, \ldots, s_T, a_T) = \begin{cases} +\infty if c_t=s_t 0 otherwise \end{cases}$
• $c_t$: state constraint at timestep $t$

## Experiments

Maze2D

• sparse reward
• Inpainting to condition on both start and goal
• Multi2D: multi-task version where the goal location changes at the beggining of each episode Block stacking
• All methods are trained on 10,000 trajectories from demonstrations generated by PDDLStream
• Sparse reward