Neural Sparse Voxel Fields
- NSVF defines a set of voxel-bounded implicit fields
- These are organized in a sparse voxel octree
- (NeRF uses a single implicit function for an entire scene)
- NSVF is ~10 times faster than NeRF at inference
- Assign a voxel embedding at each vertex of the voxel (eight of them) and aggregate them to obtain a representation of a query point
- sparse voxels containing no scene info will be pruned
What is ray marching?? You just march (walk) along the ray
Building block
Voxel-bounded implicit fields
The scene is modeled as a set of voxel-bounded implicit functions:
- : color and density of the 3D point
- : ray direction
- : the representation at point is defined as follows:
- : the eight verticies of the voxel
- : feature vectors stored at each vertex
- : trilinear interpolation
- : positional encoding
NeRF is a special case of NSVF
Rendering algorithm for NSVF
NSVF is more efficient than NeRF because there's no need to sample from the empty space! Rendering is performed in two steps:
- ray-voxel intersection
- ray-marching inside voxels
Ray-voxel intersection
Apply Axis Aligned Bounding Box intersection test (AABB-test) for each ray
- This checks if a ray intersects with a voxel
- This runs efficiently esp. for a hierarchical octree structure
- Their experiments show 10k~100k sparse voxels are enough for photorealistic results
Ray-marching inside Voxels
Volume rendering requires dense samples along the ray in non-empty space. But sampling only from non-empty space is not trivial... People have explored several approaches to fix this. But NSVF representation explicitly only encode the dense parts!!
They create a set of query points using rejection sampling based on sparse voxels. Color accumulation is done with the midpoint rule.
Optimization is end-to-end through back-propagation.
- : batch of sampled rays
- : ground-truth color of the camera-ray
- : beta-distribution regularizer ()
- : accumulated transparency
This removes non-essential voxels during training.
- : a voxel ( in their experiments)
- is a predicted density at point
- : a threshold (they use )
Progressive Training
At some step of training, they halve ray-marching step size and voxel size . When we divide a voxel, they subdivide each voxel into sub-voxels and the feature repr. of new vertices are initialized with trilinear interpolation of the original 8 vertices. They train synthetic scenes with 4 stages, and real scenes with 3 stages.