# Neural Sparse Voxel Fields

# Abstract

- NSVF defines a set of voxel-bounded implicit fields
- These are organized in a sparse voxel octree

- (NeRF uses a single implicit function for an entire scene)
- NSVF is ~10 times faster than NeRF at inference
- Assign a voxel embedding at each vertex of the voxel (eight of them) and aggregate them to obtain a representation of a query point
- sparse voxels containing no scene info will be pruned

What is ray marching??

You just march (walk) along the ray

# Building block

## Voxel-bounded implicit fields

The scene is modeled as a set of voxel-bounded implicit functions:

: color and density of the 3D point\mathbf{c}, \sigma \mathbf{p} : ray direction\mathbf{v} : the representation at pointg_i (\mathbf{p}) is defined as follows:\mathbf{p}

: the eight verticies of the voxel\mathbf{p}^*_1, \ldots, \mathbf{p}^*_8 \in \mathbb{R}^3 V_i : feature vectors stored at each vertex\tilde{g}_i (\mathbf{p}^\star_1), \ldots, \tilde{g}_i (\mathbf{p}^*_8 ) \in \mathbb{R}^d : trilinear interpolation\chi : positional encoding\zeta

**NeRF is a special case of NSVF**

## Rendering algorithm for NSVF

NSVF is more efficient than NeRF because there's no need to sample from the empty space!

Rendering is performed in two steps:

- ray-voxel intersection
- ray-marching inside voxels

### Ray-voxel intersection

Apply Axis Aligned Bounding Box intersection test (AABB-test) for each ray

- This checks if a ray intersects with a voxel
- This runs efficiently esp. for a hierarchical octree structure
- Their experiments show 10k~100k sparse voxels are enough for photorealistic results

### Ray-marching inside Voxels

Volume rendering requires dense samples along the ray in non-empty space.

But sampling only from non-empty space is not trivial...

People have explored several approaches to fix this.

But NSVF representation explicitly only encode the dense parts!!

They create a set of query points using rejection sampling based on sparse voxels.

Color accumulation is done with the *midpoint rule*.

## Learning

Optimization is end-to-end through back-propagation.

: batch of sampled raysR : ground-truth color of the camera-rayC^* : beta-distribution regularizer ()\Omega : accumulated transparencyA

### Self-pruning

This removes non-essential voxels during training.

: a voxel (V_i in their experiments)G = 16^3 is a predicted density at point\sigma(g_i(p_j)) p_j : a threshold (they use\gamma )\gamma = 0.5

### Progressive Training

At some step of training, they halve ray-marching step size

When we divide a voxel, they subdivide each voxel into

They train synthetic scenes with 4 stages, and real scenes with 3 stages.