Rough Summary: DNC

Jul 11, 2019

TL; DR

This paper introduces Differential Neural Computer (DNC)
Neural Nets cannot handle the tasks that requires complex data structures due to lack of external memory
DNC is fully differentiable and is capable to use external memory matrix just like random-access memory in computers
DNC can learn to solve tasks not by being explicitly programmed but by being trained just as other machine learning models
DNC can handle external read-write memory, and that enables it to solve complex and structured tasks such as moving block puzzle, finding the shortest path and inferring missing links in a graph

Existing neural networks cannot dynamically allocate new spaces in memory when the memory demands of a task increase (e.g., Size of hidden state vector of LSTMs is fixed)

Model: Differential Neural Computer (DNC)

First of all, let's describe the model at high level. The model can be decomposed into four components:

Controller
Read and write heads
Memory
Memory usage and temporal links

![architecture](/media/posts/dnc/dnc_architecture.png =800x)

As you can guess from the structure, this is conceptually identical to Turing Machine, which has a controller to move its tapes (memory) to the left or right, and heads to read and write on the tapes.

In this paper, the Controller is just a simple variant of LSTM that spits out what they call interface vector $\xi_t$ .

This vector facilitates where and what to read and write in the memory. The vector $\xi_t$ is simply a concatenation of different vectors used to handle the memory. This includes write vector, erase vector, write key, read key, read mode and other parameters for gating and so on.

In the following, I'll describe at high level how DNCs write to and read from the memory using interface vector $\xi_t$ .

--- Writing ---

First, write key (bottom in the green box) is used to look up the memory locations whose content are similar to the key. The model then confirms if those locations are already used or not with Memory usage (d in the figure) to decide where to write (green arrow). It represents where to write in terms of write weighting $w^w_t$ .

Once $w^w_t$ is calculated, the content is subsequently erased and overwritten based on erase vector $e_t$ and write vector $v_t$ . The equation for the "erase and overwrite" shown in the paper is pretty straight forward:

$M_t = M_{t-1} \odot (\boldsymbol{1} - w_t^w e_t^\top) + w_t^w v_t^\top$

, where $M_t$ denotes the memory matrix at timestep $t$ , $e_t$ and $v_t$ dentoes erase and write vector for each.

--- Reading ---

As in writing, the memory location whose content is similar to read key is looked up first. Then, instead of just reading from the location, the model performs a specific read operation of read mode. There are three read modes: Backward, Content and Forward.

In Content mode, the model just reads from the location being looked up by the read key. When the mode is Forward or Backward, it ignores the key and jumps forward or backward to the memory location which is edited after or before when current location is edited.

These modes often enable the model to easily read from the memory in the correct order (reading from the memory in the order those are written makes sense in many cases, as can be seen in stack and queue.)

The order of the memory locations that are written is retained in temporal links shown in d.

Again, following the process described above, the model calculates read weightings $\{w_t^{r,1}, \ldots, w_t^{r,R}\}$ that represents where to read from. The equation for this is even simpler.

$r_t^i = M_t^\top w_t^{r,i}$

The fact that there are multiple read weightings corresponds to having multiple read heads in a Turing Machine. The model reads from the memory to create $R$ read vectors $\{r_t^{i, 1}, \ldots, r_t^{i, R}\}$ . These vectors are concatenated and subsequently fed to linear layer to calculate the model's output $y_t$ .

A bit more details

We didn't dig deep about how to calculate write weighting $w_t^r$ and read weightings $\{r_t^1, \ldots, r_t^R\}$ .

In this post, instead of fully describe how to calculate those vectors, I will just explain one of the important concepts used in the paper.

Content-based addressing

We have seen that the memory locations that have similar content to a key are looked up both in reading and writing.
Content-based addressing exactly does the lookup operation.

Now, let's say we have a key vector $k$ , memory matrix $M$ . The equation shown in the paper is:

$\mathcal{C}(M, \boldsymbol{k}, \beta)[i]=\frac{\exp \{\mathcal{D}(\boldsymbol{k}, M[i, \cdot]) \beta\}}{\sum_{j} \exp \{\mathcal{D}(\boldsymbol{k}, M[j, \cdot]) \beta\}}$

, where $M[i, \cdot]$ denotes i-th row of the memory matrix, and $\mathcal{D}$ is the cosine-similarity. $\beta$ here is a scalar representing key strength. This is because $\beta$ controls the peakiness of the distribution (This equation is nothing but a soft-max, and softmax has several well-known properties including this.)

$\mathcal{C}(M, \boldsymbol{k}, \beta)$ is a similarity distribution over memory locations. This is used to calculate read and write weighting coupled with memory allocation or linkage (the order memory is edited).

Takuma Yoneda