SE(3)-DiffusionFields: Learning cost functions for joint grasp and motion optimization through diffusion
- learning task-space, data-driven cost functions as diffusion models
- Cross-entropy or Contrastive divergence
- try to create hard discr. regions in the modeled data -> this leads to large plateaus in the learned field with zero or noisy slope regions
- it is common to rely on task-specific samplers that first generate samples close to low cost regions (??)
Think about a cost function that involves multiple objective. Such functions are typically computed with cross-entropy or contrastive divergnce. These create hard discriminating regions in the modeled data, leading to zero gradient in the region. -> Common solution is to rely on task-specific samplers that provides samples from the low-cost regions before optimization.
Instead, the proposed method forms a smooth cost function that provides informative gradients in all the region.
To train a grasp pose generation model, they use Acronym dataset.
It's a large scale dataset that contains pairs of
grasp pose vs
grasp success data for many different simulated objects.
- learning smooth cost functions (i.e., the model that can provide informative gradient in the entire region) in SE(3) as diffusion models
- Sampling from SE(3) space is not trivial. They provide some mathematical foundation for that
- a single gradient-based optimization framework for jointly resolving grasp and motion generation
- Their method can generate the grasp and a trajectory to reach there at the same time
Background of Diffusion Models
They base their method on DSM: Denoising Score Matching, and sample from the model via Langevin MCMC.
Gaussian Distribution on SE(3) Lie Group
After massaging some equations in SE(3) Lie Group, they reperesent the proabability of a grasp as:
where is the mean and is the covariance matrix. Logmap: projects an SE(3) pose to a 6-dim vector.
SE(3)-Diffusion Fields (SE(3)-DiF)
mapping from a query point to a scalar value
where is diffusion timestep. Other parts are identical to DSM except that they work in SE(3) space.
- They assume the pose of the object is known
- They learn a latent feature for every object (similar to DeepSDF: Deep Signed Distance Field)
- They jointly learn the SDF of an object and grasp diffusion model
- They don't use grasp as input, but they rather use a fixed set of N points around the grasp's center to inform the grasp pose
Setup & Architecture
- They pre-define a fixed set of N 3D-points around the grasp's center (; grasp frame)
- Given a grasp pose () in world frame, they transform the points to world frame (; world frame)
- Then they use the pose of object m () to transform the points again to the object frame (; object frame)
- Encoder takes in shape code and the points in object frame and output SDF prediction (scalar for each point) and additional features
- Decoder takes both SDF prediction and features, and output energy (scalar)
TODO: I should just draw a figure by myself.
- : object
- : latent code of the object shape (SDF encoding)
- : Denoising Score Matching loss
- : grasp pose in world frame
- : object pose in world frame (during training, it's identity because it doesn't matter)
- : SDF prediction loss
- : a random 3D point in object frame
- : Sign Distance Field
Grasp and motion generation
They try to find the minimum cost trajectory :
- : a cost function
- : weights
The grasp generation model (SE(3)-DiF) forms one of the cost functions. With that, this optimization can take into account grasping pose.
:::message Cost function with SE(3)-DiF
, where is forward kinematics model. :::
Intuitively, this cost function provides low cost to those robot configurations that lead to good grasps.
How to solve the minimization / optimization problem? --> Diffusion reverse process! Note that this does NOT involve any training, this is simply the reverse process of diffusion.
- Define the distribution over trajectories as (see Planning as Inference)
- This will be the target distribution
- Run reverse diffusion process: Langevin diffusion process
A small caveat: all cost function must be differentiable to compute
Highlights from experiment section
Smooth function --> the cost function exposing informative gradients in the entire space. They learn smooth cost functions in the SE(3) robot's workspace -> task-specific costs.
The resulting models allow to move samples to the low-cost regions via inverse diffusion process.
Combining their learned diffusion model for 6D grasp pose generation with other smooth costs (trajectory smoothness, collision avoidance cost).
- learning smooth cost functions in SE(3) as diffusion models
- A single gradient-based optimization framework for grasp and motion generation
- Picking with occlusions
- Picking and reorienting an object
- pick and place in shelves
- tangent space tangent space at the identity is called Lie algebra.
- logmap (SE(3) -> R^6) and expmap (R^6 -> SE(3))
Q & A
- what is the main goal of that paper?
- what data format they use as input?
- what assumptions do they make (object centric/dense features)?
- What different assumption do we want to make?
- Is it right/wrong/preferable?
- What did they show in figure 1?
- What did they show in figure 2?
- What did they show in exp 1?
- How did they show it?
- Compared to what?
- What is the real world demo?
- Why is this demo difficult? (they did 6 DOF)