NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Apr 17, 2022

Paper: https://arxiv.org/abs/2003.08934

Input / Output

$\{(x, y, z), (\theta, \phi)\} \rightarrow \text{NeRF Network} \rightarrow \{(r, g, b), \sigma\}$

$x, y, z$ : 3D location
$\theta, \phi$ : Viewing direction
$r, g, b$ : emitted color
$\sigma$ volume density (much like opacity)
- This only depends on $(x, y, z)$ to have a consistency across viewpoints

Rendering (Basics)

Rendering is stragiht-forward: (classic) volume rendering

C(r) = \int_{t_\text{near}}^{t_\text{far}} T(t) \cdot {\color{blue} \sigma(\mathbf{r}(t))} \cdot {\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})}~dt

${\color{blue} \mathbf{\sigma}(\mathbf{r}(t))}$ : volume density at position $\mathbf{r}(t)$
${\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})}$ : the color at position $\mathbf{r}(t)$ viewed from direction $\mathbf{d}$
$T(t)$ : the probability that the ray travels from $t_\text{near}$ to $t$ : $~~~T(t) = \exp (-\int_{\text{near}}^{t} \sigma(\mathbf{r}(s))ds)$

Approximating this integral with the sum over discrete bins will suffer from low resolution.

In practice, they use hierarchical version of stratified sampling.

stratified sampling: Helps to simulate smoother integration than relying on discrete uniform bins.
hierarchical: Helps to allocate more samples to the region that affects rendering (i.e., avoid sampling a lot from empty space!)

![method](/media/posts/nerf/method.png =900x)

:::details Stratified Sampling It partitions $[t_n, t_f]$ into $N$ evenly-spaced bins and then draw one sample uniformly at random from within each bin:

t_i \sim \mathcal{U}[t_n + \frac{i-1}{N} (t_f - t_n),~t_n + \frac{i}{N} (t_f - t_n)]

It can simulate sampling from the entire space. :::

Key ideas / components

Naively training the network with above idea doesn't work. The key ideas are:

Encouraging the representation to be multiview consistent
- restricting the network to predict $\sigma$ as a function of only the location $x$
- $c$ is predicted as a function of both location and viewing direction
Positional encoding
- It's a common knowledge that (sinusoidal) positional encoding helps NNs to fit to high-freq signal:
Hierarchical sampling procedure
- Details below

Minor:

Viewing direction $(\theta, \phi)$ is concatenated to the feature vector in a middle layer of the network

Training NeRF: Hierarchical sampling

Hierarchical sampling allocates more samples to the region that affects final rendering.

They simultaneously optimize two networks: coarse one and fine one.

Sample a set of $N_c$ locations along the ray $\mathbf{r}$ using stratified sampling
- $r_1 \ldots r_{N_c}$
Evaluate the coarse network at these locations:
- $r_i \rightarrow \text{NeRF Network (Coarse)} \rightarrow \{c_i, \sigma_i\}$
Compute coarse rendering based on the samples:
- $\hat{C}_\text{coarse}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i,~~w_i = T_i \cdot (1 - \exp(-\sigma_i \delta_i))$
- $\delta_i$ : the distance between adjacent samples
- $T_i$ : the probability that the ray reaches the point $i$
- $\sigma_i$ : volume density (i.e., opacity)
Normalize the above weights to form a piecewise-constant PDF along the ray, and sample a second set of $N_f$ locations from this distribution.
- $r'_1 \ldots r'_{N_f}$
Evaluate the fine network at the all $N_c + N_f$ locations:
- $r'_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c'_i, \sigma'_i\}$
- $r_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c_i, \sigma_i\}$
Compute the final rendered color using all $N_c + N_f$ samples
- $\hat{C}_\text{fine}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i + \sum_{i=1}^{N_f}w'_i c'_i$
- Notice that second set of $N_f$ samples are biased towards region with higher $\sigma$

Training loss

$\mathcal{L} = \sum_{\mathbf{r}}[\|\hat{C}_\text{coarse}(\mathbf{r}) - C(\mathbf{r)})\|^2_2 + \|\hat{C}_\text{fine}(\mathbf{r}) - C(\mathbf{r})\|^2_2]$