# Planning with Diffusion for Flexible Behavior Synthesis

## Planning with Diffusion

Most trajectory optimization techniques require knowledge of dynamics $\mathbf{f}$. It is typically learned, and plugged into a conventional planning routine. This has a serious problem: inaccurate dynamics let a planner to take advantage of it.

The iterative denoising process of a diffusion model: sampling from perturbed distribution of the form

$\tilde{p}_\theta(\tau) \propto p_\theta(\tau) h(\tau)$

$h(\tau)$ can contain information about prior evidence.

## Generative model for trajectory planning

**Temporal ordering**: Diffuser predicts all timesteps of a plan concurrently**Temporal locality**: Each step of the denoising process can only make predictions based on local consistency of the trajectory. Local consistency -> global coherence**Trajectory representation**: States and actions are predicted jointly:

$\begin{align}
\tau &= \begin{bmatrix}
s_0 & s_1 & \ldots & s_T \\
a_0 & a_1 & \ldots & a_T
\end{bmatrix}
\end{align}$

**Architecture**: 1-d convolution and U-Nets**Training**:

\mathcal{L}(\theta) = \mathbb[E]_{i, \epsilon, \tau^0}\[ \| \epsilon - \epsilon_\theta(\tau^i, i) \|^2 \]

reverse process covariances $\Sigma^i$ follow the cosine schedule of Nichol & Dhariwal (2021)

## Reinforcement Learning as Guided Sampling

$p_\theta (\tau^{i-1} | \tau^i, \mathcal{O}_{1:T}) \sim \mathcal{N}(\tau^{i-1}; \mu + \Sigma g, \Sigma)$

$g = \nabla_\tau \log p(\mathcal{O}_{1:T} | \tau)_{\tau=\mu}
= \Sigma_{t=0}^{T} \nabla_{s_t, a_t} r(s_t, a_t) |_{(s_t, a_t) = \mu_t} = \nabla \mathcal{J}(\mu)$

Procedures:

- Train a diffusion model $p_\theta(\tau)$ on all available trajectory data
- Train a separate model $\mathcal{J}_\phi$ to predict the cumulative rewards of trajectory samples $\tau^i$

## Goal-Conditioned RL as Inpainting

Some planning problems are naturally posed as constraint satisfaction problem (e.g., terminating at a goal location).

## Properties of Diffusion Planners

The perturbation function required for this task is a Dirac delta for observed values and constant elsewhere.

$h(\tau) = \delta_{c_t}(s_0, a_0, \ldots, s_T, a_T) =
\begin{cases}
+\infty if c_t=s_t
0 otherwise
\end{cases}$

- $c_t$: state constraint at timestep $t$

## Experiments

**Maze2D**

- sparse reward
- Inpainting to condition on both start and goal
*Multi2D*: multi-task version where the goal location changes at the beggining of each episode**Block stacking**- All methods are trained on 10,000 trajectories from demonstrations generated by PDDLStream
- Sparse reward