Takuma Yoneda

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Paper: https://arxiv.org/abs/2003.08934

Input / Output

\{(x, y, z), (\theta, \phi)\} \rightarrow \text{NeRF Network} \rightarrow \{(r, g, b), \sigma\}

  • x, y, z: 3D location
  • \theta, \phi: Viewing direction
  • r, g, b: emitted color
  • \sigma volume density (much like opacity)
    • This only depends on (x, y, z) to have a consistency across viewpoints

Rendering (Basics)

Rendering is stragiht-forward: (classic) volume rendering

C(r) = \int_{t_\text{near}}^{t_\text{far}} T(t) \cdot {\color{blue} \sigma(\mathbf{r}(t))} \cdot {\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})}~dt
  • {\color{blue} \mathbf{\sigma}(\mathbf{r}(t))}: volume density at position \mathbf{r}(t)
  • {\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})}: the color at position \mathbf{r}(t) viewed from direction \mathbf{d}
  • T(t): the probability that the ray travels from t_\text{near} to t: ~~~T(t) = \exp (-\int_{\text{near}}^{t} \sigma(\mathbf{r}(s))ds)

Approximating this integral with the sum over discrete bins will suffer from low resolution.

In practice, they use hierarchical version of stratified sampling.

  • stratified sampling: Helps to simulate smoother integration than relying on discrete uniform bins.
  • hierarchical: Helps to allocate more samples to the region that affects rendering (i.e., avoid sampling a lot from empty space!)


Stratified Sampling

It partitions [t_n, t_f] into N evenly-spaced bins and then draw one sample uniformly at random from within each bin:

t_i \sim \mathcal{U}[t_n + \frac{i-1}{N} (t_f - t_n),~t_n + \frac{i}{N} (t_f - t_n)]

It can simulate sampling from the entire space.

Key ideas / components

Naively training the network with above idea doesn't work. The key ideas are:


  • Viewing direction (\theta, \phi) is concatenated to the feature vector in a middle layer of the network

Training NeRF: Hierarchical sampling

Hierarchical sampling allocates more samples to the region that affects final rendering.

They simultaneously optimize two networks: coarse one and fine one.

  1. Sample a set of N_c locations along the ray \mathbf{r} using stratified sampling

    • r_1 \ldots r_{N_c}
  2. Evaluate the coarse network at these locations:

    • r_i \rightarrow \text{NeRF Network (Coarse)} \rightarrow \{c_i, \sigma_i\}
  3. Compute coarse rendering based on the samples:

    • \hat{C}_\text{coarse}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i,~~w_i = T_i \cdot (1 - \exp(-\sigma_i \delta_i))
    • \delta_i: the distance between adjacent samples
    • T_i: the probability that the ray reaches the point i
    • \sigma_i: volume density (i.e., opacity)
  4. Normalize the above weights to form a piecewise-constant PDF along the ray, and sample a second set of N_f locations from this distribution.

    • r'_1 \ldots r'_{N_f}
  5. Evaluate the fine network at the all N_c + N_f locations:

    • r'_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c'_i, \sigma'_i\}
    • r_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c_i, \sigma_i\}
  6. Compute the final rendered color using all N_c + N_f samples

    • \hat{C}_\text{fine}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i + \sum_{i=1}^{N_f}w'_i c'_i
    • Notice that second set of N_f samples are biased towards region with higher \sigma

Training loss

\mathcal{L} = \sum_{\mathbf{r}}[\|\hat{C}_\text{coarse}(\mathbf{r}) - C(\mathbf{r)})\|^2_2 + \|\hat{C}_\text{fine}(\mathbf{r}) - C(\mathbf{r})\|^2_2]