Neural Sparse Voxel Fields

Abstract

  • NSVF defines a set of voxel-bounded implicit fields
    • These are organized in a sparse voxel octree
  • (NeRF uses a single implicit function for an entire scene)
  • NSVF is ~10 times faster than NeRF at inference
  • Assign a voxel embedding at each vertex of the voxel (eight of them) and aggregate them to obtain a representation of a query point
  • sparse voxels containing no scene info will be pruned

What is ray marching?? You just march (walk) along the ray

Building block

Voxel-bounded implicit fields

The scene is modeled as a set of voxel-bounded implicit functions:

Fθi:(gi(p),v)(c,σ),pViF_\theta^i: (g_i (\mathbf{p}) , \mathbf{v}) \rightarrow (\mathbf{c}, \sigma), \forall \mathbf{p} \in V_i
  • c,σ\mathbf{c}, \sigma: color and density of the 3D point p\mathbf{p}
  • v\mathbf{v}: ray direction
  • gi(p)g_i (\mathbf{p}): the representation at point p\mathbf{p} is defined as follows:
gi(p)=ζ(χ(g~i(p1),,g~i(p8)))g_i(\mathbf{p}) = \zeta (\chi ( \tilde{g}_i (\mathbf{p}^\star_1), \ldots, \tilde{g}_i (\mathbf{p}^*_8 )))
  • p1,,p8R3\mathbf{p}^*_1, \ldots, \mathbf{p}^*_8 \in \mathbb{R}^3: the eight verticies of the voxel ViV_i
  • g~i(p1),,g~i(p8)Rd\tilde{g}_i (\mathbf{p}^\star_1), \ldots, \tilde{g}_i (\mathbf{p}^*_8 ) \in \mathbb{R}^d: feature vectors stored at each vertex
  • χ\chi: trilinear interpolation
  • ζ\zeta: positional encoding

NeRF is a special case of NSVF

Rendering algorithm for NSVF

NSVF is more efficient than NeRF because there's no need to sample from the empty space! Rendering is performed in two steps:

  1. ray-voxel intersection
  2. ray-marching inside voxels

![sampling](/media/posts/nsvf/nsvf-sampling.png =600x) ![pipeline](/media/posts/nsvf/nsvf-pipeline.png =600x)

Ray-voxel intersection

Apply Axis Aligned Bounding Box intersection test (AABB-test) for each ray

  • This checks if a ray intersects with a voxel
  • This runs efficiently esp. for a hierarchical octree structure
  • Their experiments show 10k~100k sparse voxels are enough for photorealistic results

Ray-marching inside Voxels

Volume rendering requires dense samples along the ray in non-empty space. But sampling only from non-empty space is not trivial... People have explored several approaches to fix this. But NSVF representation explicitly only encode the dense parts!!

They create a set of query points using rejection sampling based on sparse voxels. Color accumulation is done with the midpoint rule.

Learning

Optimization is end-to-end through back-propagation.

L=(p0,v)RC(p0,v)C(p0,v)22+λΩ(A(p0,v))\mathcal{L} = \sum_{(p_0, v) \in R} \| C(p_0, v) - C^*(p_0, v) \|^2_2 + \lambda \Omega (A(p_0, v))
  • RR: batch of sampled rays
  • CC^*: ground-truth color of the camera-ray
  • Ω\Omega: beta-distribution regularizer ()
  • AA: accumulated transparency

Self-pruning

This removes non-essential voxels during training.

Vi  is pruned if  minj=1Gexp(σ(gi(pj)))>γ,   pjVi,   ViVV_i~\text{ is pruned if }~\min_{j=1\ldots G} \exp (- \sigma (g_i (p_j))) > \gamma, ~~~ p_j \in V_i, ~~~ V_i \in \mathcal{V}
  • ViV_i: a voxel (G=163G = 16^3 in their experiments)
  • σ(gi(pj))\sigma(g_i(p_j)) is a predicted density at point pjp_j
  • γ\gamma: a threshold (they use γ=0.5\gamma = 0.5)

Progressive Training

At some step of training, they halve ray-marching step size τ\tau and voxel size ll. When we divide a voxel, they subdivide each voxel into 232^3 sub-voxels and the feature repr. of new vertices are initialized with trilinear interpolation of the original 8 vertices. They train synthetic scenes with 4 stages, and real scenes with 3 stages.