NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Paper: https://arxiv.org/abs/2003.08934
Input / Output
: 3D locationx, y, z : Viewing direction\theta, \phi : emitted colorr, g, b volume density (much like opacity)\sigma - This only depends on
to have a consistency across viewpoints(x, y, z)
- This only depends on
Rendering (Basics)
Rendering is stragiht-forward: (classic) volume rendering
: volume density at position{\color{blue} \mathbf{\sigma}(\mathbf{r}(t))} \mathbf{r}(t) : the color at position{\color{green} \mathbf{c}(\mathbf{r}(t), \mathbf{d})} viewed from direction\mathbf{r}(t) \mathbf{d} : the probability that the ray travels fromT(t) tot_\text{near} :t ~~~T(t) = \exp (-\int_{\text{near}}^{t} \sigma(\mathbf{r}(s))ds)
Approximating this integral with the sum over discrete bins will suffer from low resolution.
In practice, they use hierarchical version of stratified sampling.
- stratified sampling: Helps to simulate smoother integration than relying on discrete uniform bins.
- hierarchical: Helps to allocate more samples to the region that affects rendering (i.e., avoid sampling a lot from empty space!)
Stratified Sampling
It partitions
It can simulate sampling from the entire space.
Key ideas / components
Naively training the network with above idea doesn't work. The key ideas are:
- Encouraging the representation to be multiview consistent
- restricting the network to predict
as a function of only the location\sigma x - <->
is predicted as a function of both location and viewing directionc
- restricting the network to predict
- Positional encoding
- It's a common knowledge that (sinusoidal) positional encoding helps NNs to fit to high-freq signal:
- Hierarchical sampling procedure
- Details below
Minor:
- Viewing direction
is concatenated to the feature vector in a middle layer of the network(\theta, \phi)
Training NeRF: Hierarchical sampling
Hierarchical sampling allocates more samples to the region that affects final rendering.
They simultaneously optimize two networks: coarse one and fine one.
-
Sample a set of
locations along the rayN_c using stratified sampling\mathbf{r} -
r_1 \ldots r_{N_c}
-
-
Evaluate the coarse network at these locations:
-
r_i \rightarrow \text{NeRF Network (Coarse)} \rightarrow \{c_i, \sigma_i\}
-
-
Compute coarse rendering based on the samples:
-
\hat{C}_\text{coarse}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i,~~w_i = T_i \cdot (1 - \exp(-\sigma_i \delta_i)) : the distance between adjacent samples\delta_i : the probability that the ray reaches the pointT_i i : volume density (i.e., opacity)\sigma_i
-
-
Normalize the above weights to form a piecewise-constant PDF along the ray, and sample a second set of
locations from this distribution.N_f -
r'_1 \ldots r'_{N_f}
-
-
Evaluate the fine network at the all
locations:N_c + N_f -
r'_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c'_i, \sigma'_i\} -
r_i \rightarrow \text{NeRF Network (Fine)} \rightarrow \{c_i, \sigma_i\}
-
-
Compute the final rendered color using all
samplesN_c + N_f -
\hat{C}_\text{fine}(\mathbf{r}) = \sum_{i=1}^{N_c}w_i c_i + \sum_{i=1}^{N_f}w'_i c'_i - Notice that second set of
samples are biased towards region with higherN_f \sigma
-