# SE(3)-DiffusionFields: Learning cost functions for joint grasp and motion optimization through diffusion

## Highlights

- learning task-space, data-driven cost functions as diffusion models
- Cross-entropy or Contrastive divergence
- try to create hard discr. regions in the modeled data -> this leads to large plateaus in the learned field with zero or noisy slope regions

- it is common to rely on task-specific samplers that first generate samples close to low cost regions (??)

Think about a cost function that involves multiple objective.

Such functions are typically computed with cross-entropy or contrastive divergnce.

These create hard discriminating regions in the modeled data, leading to zero gradient in the region.

-> Common solution is to rely on task-specific samplers that provides samples from the low-cost regions before optimization.

Instead, the proposed method forms a *smooth* cost function that provides informative gradients in all the region.

To train a grasp pose generation model, they use Acronym dataset.

It's a large scale dataset that contains pairs of `grasp pose`

vs `grasp success`

data for many different simulated objects.

## Contributions

- learning
**smooth cost functions**(i.e., the model that can provide informative gradient in the entire region)**in SE(3)**as diffusion models

- Sampling from SE(3) space is not trivial. They provide some mathematical foundation for that

- a single
**gradient-based optimization**framework for**jointly resolving grasp and motion generation**

- Their method can generate the grasp and a trajectory to reach there at the same time

## Background of Diffusion Models

They base their method on DSM: Denoising Score Matching, and sample from the model via Langevin MCMC.

## Gaussian Distribution on SE(3) Lie Group

After massaging some equations in SE(3) Lie Group, they reperesent the proabability of a grasp

where

Logmap:

## SE(3)-Diffusion Fields (SE(3)-DiF)

mapping from a query point

where

Other parts are identical to DSM except that they work in SE(3) space.

## Training SE(3)-GraspDiffusionFields

- They assume the pose of the object is known
- They learn a latent feature for every object (similar to DeepSDF: Deep Signed Distance Field)
- They jointly learn the SDF of an object and grasp diffusion model
- They don't use grasp as input, but they rather use a fixed set of N points around the grasp's center to inform the grasp pose

**Setup & Architecture**

- They pre-define a fixed set of N 3D-points around the grasp's center (
; grasp frame)\mathbf{x}_g \in \mathbb{R}^{N \times 3} - Given a grasp pose (
) in world frame, they transform the points to world frame (\mathbf{H}^w_g ; world frame)\mathbf{x}_w = \mathbf{H}^w_g \mathbf{x_g} \in \mathbb{R}^{N \times 3} - Then they use the pose of object
*m*( ) to transform the points again to the object frame (\mathbf{H}_w^{o_m} ; object frame)\mathbf{x}_{o_m} = \mathbf{H}^{o_m}_w \mathbf{x}_w \in \mathbb{R}^{N \times 3} - Encoder
takes in shape codeF_\theta and the points in object frame\mathbf{z}_m \in \mathbb{R}^z and output SDF prediction (scalar for each point) and additional features\mathbf{x}_{o_m} \in \mathbb{R}^{N \times 3} \psi \in \mathbb{R}^{N \times \psi} - Decoder takes both SDF prediction and features, and output energy (scalar)

TODO: I should just draw a figure by myself.

**Training Objective**

: objecto_m : latent code of the object shape (SDF encoding)z_m \in \mathbb{R}^z : Denoising Score Matching loss\mathcal{L}_\text{dsm} : grasp pose in world frame\mathbf{H}^w_g \in SE(3) : object pose in world frame (during training, it's identity because it doesn't matter)\mathbf{H}^{o_m}_{w} \in SE(3)

: SDF prediction loss\mathcal{L}_\text{sdf} : a random 3D point in object frame\mathbf{x}_o \in \mathbb{R}^{N \times 3} : Sign Distance Field\text{sdf}_{o_m}: \mathbb{R}^3 \rightarrow \mathbb{R}^1

## Grasp and motion generation

They try to find the minimum cost trajectory

: a cost functionc_j(\tau) : weightsw_j > 0

The grasp generation model (SE(3)-DiF) forms one of the cost functions.

With that, this optimization can take into account grasping pose.

## Cost function for a grasp pose

Cost function with SE(3)-DiF

$$

c(\mathbf{q_t}, k) = E_\theta (\phi_{ee} (\mathbf{q_t}), k)

$$

, where

Intuitively, this cost function provides low cost to those robot configurations that lead to good grasps.

How to solve the minimization / optimization problem? --> Diffusion reverse process!

Note that this does NOT involve any training, this is simply the reverse process of diffusion.

- Define the distribution over trajectories as
(see Planning as Inference)q(\mathbf{\tau}|k) \propto \exp(- \mathcal{J}(\mathbf{\tau}, k))

- This
will be the target distributionq(\mathbf{\tau}|k)

- Run reverse diffusion process: Langevin diffusion process

A small caveat: all cost function must be differentiable to compute

Scratch

*Smooth* function --> the cost function exposing informative gradients in the entire space.

They learn smooth cost functions in the SE(3) robot's workspace -> task-specific costs.

Acronym dataset

The resulting models allow to move samples to the low-cost regions via inverse diffusion process.

Combining their learned diffusion model for 6D grasp pose generation with other smooth costs (trajectory smoothness, collision avoidance cost).

## Contributions

- learning smooth cost functions in SE(3) as diffusion models
- A single gradient-based optimization framework for grasp and motion generation

## Experiments

Three tasks

- Picking with occlusions
- Picking and reorienting an object
- pick and place in shelves

## Terms

- tangent space

tangent space at the identity is called Lie algebra. - logmap (SE(3) -> R^6) and expmap (R^6 -> SE(3))

## Q & A

- what is the main goal of that paper?
- what data format they use as input?
- what assumptions do they make (object centric/dense features)?
- What different assumption do we want to make?
- Is it right/wrong/preferable?
- What did they show in figure 1?
- What did they show in figure 2?
- What did they show in exp 1?
- How did they show it?
- Compared to what?
- What is the real world demo?
- Why is this demo difficult? (they did 6 DOF)