Neural Sparse Voxel Fields
Abstract
- NSVF defines a set of voxel-bounded implicit fields
- These are organized in a sparse voxel octree
- (NeRF uses a single implicit function for an entire scene)
- NSVF is ~10 times faster than NeRF at inference
- Assign a voxel embedding at each vertex of the voxel (eight of them) and aggregate them to obtain a representation of a query point
- sparse voxels containing no scene info will be pruned
What is ray marching??
You just march (walk) along the ray
Building block
Voxel-bounded implicit fields
The scene is modeled as a set of voxel-bounded implicit functions:
: color and density of the 3D point\mathbf{c}, \sigma \mathbf{p} : ray direction\mathbf{v} : the representation at pointg_i (\mathbf{p}) is defined as follows:\mathbf{p}
: the eight verticies of the voxel\mathbf{p}^*_1, \ldots, \mathbf{p}^*_8 \in \mathbb{R}^3 V_i : feature vectors stored at each vertex\tilde{g}_i (\mathbf{p}^\star_1), \ldots, \tilde{g}_i (\mathbf{p}^*_8 ) \in \mathbb{R}^d : trilinear interpolation\chi : positional encoding\zeta
NeRF is a special case of NSVF
Rendering algorithm for NSVF
NSVF is more efficient than NeRF because there's no need to sample from the empty space!
Rendering is performed in two steps:
- ray-voxel intersection
- ray-marching inside voxels
Ray-voxel intersection
Apply Axis Aligned Bounding Box intersection test (AABB-test) for each ray
- This checks if a ray intersects with a voxel
- This runs efficiently esp. for a hierarchical octree structure
- Their experiments show 10k~100k sparse voxels are enough for photorealistic results
Ray-marching inside Voxels
Volume rendering requires dense samples along the ray in non-empty space.
But sampling only from non-empty space is not trivial...
People have explored several approaches to fix this.
But NSVF representation explicitly only encode the dense parts!!
They create a set of query points using rejection sampling based on sparse voxels.
Color accumulation is done with the midpoint rule.
Learning
Optimization is end-to-end through back-propagation.
: batch of sampled raysR : ground-truth color of the camera-rayC^* : beta-distribution regularizer ()\Omega : accumulated transparencyA
Self-pruning
This removes non-essential voxels during training.
: a voxel (V_i in their experiments)G = 16^3 is a predicted density at point\sigma(g_i(p_j)) p_j : a threshold (they use\gamma )\gamma = 0.5
Progressive Training
At some step of training, they halve ray-marching step size
When we divide a voxel, they subdivide each voxel into
They train synthetic scenes with 4 stages, and real scenes with 3 stages.