Takuma Yoneda

Multi-step returns in RL

Monte-Carlo return

\begin{align*} R_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{T-t} r_{T} \\ &= \sum_{l=0}^{T-t} \gamma^l r_{t+l} \end{align*}

n-step return

\begin{align*} R^{(n)}_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1}r_{t+n-1} + \gamma^{n} V(s_{t+n}) \\ &= \sum_{l=0}^{n-1}\big[ \gamma^l r_{t+l} \big] + \gamma^n V(s_{t+n}) \end{align*}

Basically everything after n-th step is bootstrapped with the value function

n what? expression biased? variance?
1 1-step return
r_t + \gamma V(s_{t+1})
🙅‍♂️ biased 🙆‍♂️ low-variance
\vdots
\infty Monte-Carlo return
\sum_{l=0}^{T-t} \gamma^l r_{t+l}
🙆‍♂️ unbiased 🙅‍♂️ high-variance

1-step return is common in Q-learning.

Notice that n acts as a trade-off between bias and variance.
Why don't we mix these? --> \lambda-return

λ-return (TD(λ))

Exponentially weighted average of n-step returns

R_t(\lambda) = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} R_t^{(n)}

Q. Why do we need (1-\lambda) there?:

It's necessary for normalization, as you can see from

(1 - r) \sum_{k=1}^{\infty} ar^{k-1} = a
Basics of Geometric progression

Geometric Progression

\sum_{k=1}^n ar^{k-1} = ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}

A trick: multiply (1-r) on both sides

\begin{align*} (1-r)\sum_{k=1}^{n} ar^{k-1} &= (1-r)(ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}) \\ &= (1-r)(ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}) \\ &= (ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}) - (ar^1 + ar^2 + ar^3 + \cdots + ar^n) \\ &= a - ar^n \\ &= a(1-r^n) \\ \end{align*}

Thus,

\sum_{k=1}^{n}ar^{k-1} = \frac{a(1 - r^n)}{1 - r}

If n \rightarrow \infty,

\sum_{k=1}^{\infty} ar^{k-1} = \frac{a}{1 - r}
  • \lambda=0 \Longrightarrow R^{(1)}_t: 1-step return
  • \lambda=1 \Longrightarrow R^{(\infty)}_t (=R_t): Monte-Carlo return

Generalized Advantage Estimator: GAE(λ)

In one word: we simply repeat the same discussion for advantages (TD residuals), instead of returns.

This simply estimates advantage with λ-return: \hat{\mathcal{A}}_t = R_t(\lambda) - V(s_t).

Basics: Value function and Q-function

\begin{align*} V^{\pi, \gamma}(s_t) &\coloneqq \mathbb{E_{ \substack{s_{t+1:\infty}, \\ {\color{blue} a_{t:\infty}}} }}[\sum_{l=0}^\infty \gamma^l r_{t+l}] \\ Q^{\pi, \gamma}(s_t, a_t) &\coloneqq \mathbb{E}_{ \substack{s_{t+1:\infty} \\ {\color{blue} a_{t+1:\infty}}} }[\sum_{l=0}^{\infty} \gamma^l r_{t+l}] \\ \end{align*}

Advantage is defined as their difference:

A^{\pi, \gamma}(s_t, a_t) \coloneqq Q^{\pi, \gamma}(s_t, a_t) - V^{\pi, \gamma}(s_t)

TD residual

Now, we define TD residual \delta^V_t:

\delta_t^{V^{\pi, \gamma}} = r_t + \gamma V^{\pi, \gamma}(s_{t+1}) - V^{\pi, \gamma}(s_t)

This is in fact, an unbiased estimator of A^{\pi, \gamma}(s_t, a_t):

\begin{align*} \mathbb{E}_{s_{t+1}} \big[ \delta_t^{V^{\pi, \gamma}} \big] &= \mathbb{E}_{s_{t+1}} [r_t + \gamma V^{\pi, \gamma}(s_{t+1}) - V^{\pi, \gamma}(s_t)] \\ &= \mathbb{E}_{s_{t+1}} [Q^{\pi, \gamma}(s_t, a_t) - V^{\pi, \gamma}(s_t)] \\ &= A^{\pi, \gamma}(s_t, a_t) \end{align*}

n-step TD residual

\begin{align*} \hat{A}_t^{(n)} &\coloneqq \sum_{l=0}^{n-1} \gamma^l \delta_{t+l}^{V} = \delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V + \cdots + \gamma^{n-1} \delta_{t+n-1}^{n-1} \\ &=^{*} r_t + \gamma r_{t+1} + ~\cdots~ + \gamma^{n-1}r_{t+n-1} +~\gamma^n V(s_{t+n}) - V(s_t) \\ &= \sum_{l=0}^{n-1}[\gamma^l r_{t+l}] + \gamma^n V(s_{t+n}) - V(s_t) \\ &= R^{(n)}_t - V(s_t) \end{align*}

* Notice that this is a telescoping sum.

GAE

Exponentially weighted average of n-step TD residuals

\begin{align*} \hat{A}_t^{\text{GAE}(\gamma, \lambda)} &\coloneqq (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}\hat{A}_t^{(n)} \\ &= (1-\lambda)\big(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V) + \cdots \big) \\ &= (1-\lambda)\big(\delta_t^V(1 + \lambda + \lambda^2 + \cdots) + \gamma \delta_{t+1}^V(\lambda + \lambda^2 + \lambda^3 + \cdots) + \cdots \big) \\ &= (1-\lambda)\big(\delta_t^V(\frac{1}{1-\lambda}) + \gamma \delta_{t+1}^V(\frac{\lambda}{1 - \lambda}) + \gamma^2 \delta_{t+1}^V(\frac{\lambda^2}{1 - \lambda}) + \cdots \big) \\ &= \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^V \end{align*}
  • \lambda=0 -> unbiased regardless of the accuracy of V, but high variance
  • \lambda=\infty -> unbiased only if V = V^{\pi, \gamma}, and low variance

Reference