NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

Paper: https://arxiv.org/abs/2003.08934

Input / Output

{(x,y,z),(θ,ϕ)}NeRF Network{(r,g,b),σ}\{(x, y, z), (\theta, \phi)\} \rightarrow \text{NeRF Network} \rightarrow \{(r, g, b), \sigma\}

  • x,y,zx, y, z: 3D location
  • θ,ϕ\theta, \phi: Viewing direction
  • r,g,br, g, b: emitted color
  • σ\sigma volume density (much like opacity)
    • This only depends on (x,y,z)(x, y, z) to have a consistency across viewpoints

Rendering (Basics)

Rendering is stragiht-forward: (classic) volume rendering

C(r)=tneartfarT(t)σ(r(t))c(r(t),d) dtC(r) = \int_{t_\text{near}}^{t_\text{far}} T(t) \cdot {\color{blue} \sigma(\mathbf{r}(t))} \cdot {\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})}~dt
  • σ(r(t)){\color{blue} \mathbf{\sigma}(\mathbf{r}(t))}: volume density at position r(t)\mathbf{r}(t)
  • c(r(t),d){\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})}: the color at position r(t)\mathbf{r}(t) viewed from direction d\mathbf{d}
  • T(t)T(t): the probability that the ray travels from tneart_\text{near} to tt:    T(t)=exp(neartσ(r(s))ds)~~~T(t) = \exp (-\int_{\text{near}}^{t} \sigma(\mathbf{r}(s))ds)

Approximating this integral with the sum over discrete bins will suffer from low resolution.

In practice, they use hierarchical version of stratified sampling.

  • stratified sampling: Helps to simulate smoother integration than relying on discrete uniform bins.
  • hierarchical: Helps to allocate more samples to the region that affects rendering (i.e., avoid sampling a lot from empty space!)

![method](/media/posts/nerf/method.png =900x)

:::details Stratified Sampling It partitions [tn,tf][t_n, t_f] into NN evenly-spaced bins and then draw one sample uniformly at random from within each bin:

tiU[tn+i1N(tftn), tn+iN(tftn)]t_i \sim \mathcal{U}[t_n + \frac{i-1}{N} (t_f - t_n),~t_n + \frac{i}{N} (t_f - t_n)]

It can simulate sampling from the entire space. :::

Key ideas / components

Naively training the network with above idea doesn't work. The key ideas are:

Minor:

  • Viewing direction (θ,ϕ)(\theta, \phi) is concatenated to the feature vector in a middle layer of the network

Training NeRF: Hierarchical sampling

Hierarchical sampling allocates more samples to the region that affects final rendering.

They simultaneously optimize two networks: coarse one and fine one.

  1. Sample a set of NcN_c locations along the ray r\mathbf{r} using stratified sampling

    • r1rNcr_1 \ldots r_{N_c}
  2. Evaluate the coarse network at these locations:

    • riNeRF Network (Coarse){ci,σi}r_i \rightarrow \text{NeRF Network (Coarse)} \rightarrow \{c_i, \sigma_i\}
  3. Compute coarse rendering based on the samples:

    • C^coarse(r)=i=1Ncwici,  wi=Ti(1exp(σiδi))\hat{C}_\text{coarse}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i,~~w_i = T_i \cdot (1 - \exp(-\sigma_i \delta_i))
    • δi\delta_i: the distance between adjacent samples
    • TiT_i: the probability that the ray reaches the point ii
    • σi\sigma_i: volume density (i.e., opacity)
  4. Normalize the above weights to form a piecewise-constant PDF along the ray, and sample a second set of NfN_f locations from this distribution.

    • r1rNfr'_1 \ldots r'_{N_f}
  5. Evaluate the fine network at the all Nc+NfN_c + N_f locations:

    • riNeRF Network (Fine){ci,σi}r'_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c'_i, \sigma'_i\}
    • riNeRF Network (Fine){ci,σi}r_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c_i, \sigma_i\}
  6. Compute the final rendered color using all Nc+NfN_c + N_f samples

    • C^fine(r)=i=1Ncwici+i=1Nfwici\hat{C}_\text{fine}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i + \sum_{i=1}^{N_f}w'_i c'_i
    • Notice that second set of NfN_f samples are biased towards region with higher σ\sigma

Training loss

L=r[C^coarse(r)C(r))22+C^fine(r)C(r)22]\mathcal{L} = \sum_{\mathbf{r}}[\|\hat{C}_\text{coarse}(\mathbf{r}) - C(\mathbf{r)})\|^2_2 + \|\hat{C}_\text{fine}(\mathbf{r}) - C(\mathbf{r})\|^2_2]