Multi-step returns in RL

Monte-Carlo return

Rt=rt+γrt+1+γ2rt+2++γTtrT=l=0Ttγlrt+l\begin{align*} R_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{T-t} r_{T} \\ &= \sum_{l=0}^{T-t} \gamma^l r_{t+l} \end{align*}

n-step return

Rt(n)=rt+γrt+1+γ2rt+2++γn1rt+n1+γnV(st+n)=l=0n1[γlrt+l]+γnV(st+n)\begin{align*} R^{(n)}_t &= r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \ldots + \gamma^{n-1}r_{t+n-1} + \gamma^{n} V(s_{t+n}) \\ &= \sum_{l=0}^{n-1}\big[ \gamma^l r_{t+l} \big] + \gamma^n V(s_{t+n}) \end{align*}

Basically everything after n-th step is bootstrapped with the value function

:::message nn \rightarrow \infty recovers Monte-Carlo return. This assumes the rewards after horizon TT are all zeros:

Rt()=l=0Tt[γlrt+l]+l=Tt+1[γl0]+γV(st+)=l=0Tt[γlrt+l]=Rt.\begin{align*} R^{(\infty)}_t &= \sum_{l=0}^{T-t} \big[ \gamma^l r_{t+l} \big] + \sum_{l=T - t + 1}^{\infty} \big[ \gamma^l \cdot 0 \big] + \gamma^\infty V(s_{t+\infty}) \\ &= \sum_{l=0}^{T-t}\big[ \gamma^l r_{t+l} \big] = R_t. \end{align*}

:::

nnwhat?expressionbiased?variance?
11-step returnrt+γV(st+1)r_t + \gamma V(s_{t+1})🙅‍♂️ biased🙆‍♂️ low-variance
\vdots
\inftyMonte-Carlo returnl=0Ttγlrt+l\sum_{l=0}^{T-t} \gamma^l r_{t+l}🙆‍♂️ unbiased🙅‍♂️ high-variance

1-step return is common in Q-learning.

Notice that nn acts as a trade-off between bias and variance. Why don't we mix these? --> λ\lambda-return

λ-return (TD(λ))

Exponentially weighted average of n-step returns

Rt(λ)=(1λ)n=1λn1Rt(n)R_t(\lambda) = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} R_t^{(n)}

Q. Why do we need (1λ)(1-\lambda) there?:

It's necessary for normalization, as you can see from

(1r)k=1ark1=a(1 - r) \sum_{k=1}^{\infty} ar^{k-1} = a

:::details Basics of Geometric progression

Geometric Progression

k=1nark1=ar0+ar1+ar2++arn1\sum_{k=1}^n ar^{k-1} = ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}

A trick: multiply (1r)(1-r) on both sides

(1r)k=1nark1=(1r)(ar0+ar1+ar2++arn1)=(1r)(ar0+ar1+ar2++arn1)=(ar0+ar1+ar2++arn1)(ar1+ar2+ar3++arn)=aarn=a(1rn)\begin{align*} (1-r)\sum_{k=1}^{n} ar^{k-1} &= (1-r)(ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}) \\ &= (1-r)(ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}) \\ &= (ar^0 + ar^1 + ar^2 + \cdots + ar^{n-1}) - (ar^1 + ar^2 + ar^3 + \cdots + ar^n) \\ &= a - ar^n \\ &= a(1-r^n) \\ \end{align*}

Thus,

k=1nark1=a(1rn)1r\sum_{k=1}^{n}ar^{k-1} = \frac{a(1 - r^n)}{1 - r}

If nn \rightarrow \infty,

k=1ark1=a1r\sum_{k=1}^{\infty} ar^{k-1} = \frac{a}{1 - r}

:::

  • λ=0Rt(1)\lambda=0 \Longrightarrow R^{(1)}_t: 1-step return
  • λ=1Rt()(=Rt)\lambda=1 \Longrightarrow R^{(\infty)}_t (=R_t): Monte-Carlo return

Generalized Advantage Estimator: GAE(λ)

In one word: we simply repeat the same discussion for advantages (TD residuals), instead of returns.

This simply estimates advantage with λ-return: A^t=Rt(λ)V(st).\hat{\mathcal{A}}_t = R_t(\lambda) - V(s_t).

Basics: Value function and Q-function

\begin{align*} V^{\pi, \gamma}(s_t) &\coloneqq \mathbb{E_{ \substack{s_{t+1:\infty}, \\ {\color{blue} a_{t:\infty}}} }}[\sum_{l=0}^\infty \gamma^l r_{t+l}] \\ Q^{\pi, \gamma}(s_t, a_t) &\coloneqq \mathbb{E}_{ \substack{s_{t+1:\infty} \\ {\color{blue} a_{t+1:\infty}}} }[\sum_{l=0}^{\infty} \gamma^l r_{t+l}] \\ {/* &= r_t + \gamma \cdot \mathbb{E}_{\substack{s_{t+2:\infty} \\ a_{t+1:\infty}}}[\sum_{l=0}^{\infty} \gamma^l r_{t+l}] \\ */} {/* &= r_t + \gamma V^{\pi, \gamma}(s_{t+1}) */} \end{align*}

Advantage is defined as their difference:

Aπ,γ(st,at)Qπ,γ(st,at)Vπ,γ(st)A^{\pi, \gamma}(s_t, a_t) \coloneqq Q^{\pi, \gamma}(s_t, a_t) - V^{\pi, \gamma}(s_t)

TD residual

Now, we define TD residual δtV\delta^V_t:

δtVπ,γ=rt+γVπ,γ(st+1)Vπ,γ(st)\delta_t^{V^{\pi, \gamma}} = r_t + \gamma V^{\pi, \gamma}(s_{t+1}) - V^{\pi, \gamma}(s_t)

This is in fact, an unbiased estimator of Aπ,γ(st,at)A^{\pi, \gamma}(s_t, a_t):

Est+1[δtVπ,γ]=Est+1[rt+γVπ,γ(st+1)Vπ,γ(st)]=Est+1[Qπ,γ(st,at)Vπ,γ(st)]=Aπ,γ(st,at)\begin{align*} \mathbb{E}_{s_{t+1}} \big[ \delta_t^{V^{\pi, \gamma}} \big] &= \mathbb{E}_{s_{t+1}} [r_t + \gamma V^{\pi, \gamma}(s_{t+1}) - V^{\pi, \gamma}(s_t)] \\ &= \mathbb{E}_{s_{t+1}} [Q^{\pi, \gamma}(s_t, a_t) - V^{\pi, \gamma}(s_t)] \\ &= A^{\pi, \gamma}(s_t, a_t) \end{align*}

n-step TD residual

A^t(n)l=0n1γlδt+lV=δtV+γδt+1V+γ2δt+2V++γn1δt+n1n1=rt+γrt+1+  +γn1rt+n1+ γnV(st+n)V(st)=l=0n1[γlrt+l]+γnV(st+n)V(st)=Rt(n)V(st)\begin{align*} \hat{A}_t^{(n)} &\coloneqq \sum_{l=0}^{n-1} \gamma^l \delta_{t+l}^{V} = \delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V + \cdots + \gamma^{n-1} \delta_{t+n-1}^{n-1} \\ &=^{*} r_t + \gamma r_{t+1} + ~\cdots~ + \gamma^{n-1}r_{t+n-1} +~\gamma^n V(s_{t+n}) - V(s_t) \\ &= \sum_{l=0}^{n-1}[\gamma^l r_{t+l}] + \gamma^n V(s_{t+n}) - V(s_t) \\ &= R^{(n)}_t - V(s_t) \end{align*}

* Notice that this is a telescoping sum. :::message nn\rightarrow\infty becomes: [empirical returns] - [value function baseline]

A^t()=l=0γlδt+lV=l=0[γlrt+l]V(st)\hat{A}^{(\infty)}_t = \sum_{l=0}^{\infty} \gamma^l \delta_{t+l}^V = \sum_{l=0}^{\infty} [\gamma^l r_{t+l}] - V(s_t)

:::

GAE

Exponentially weighted average of n-step TD residuals

A^tGAE(γ,λ)(1λ)n=1λn1A^t(n)=(1λ)(δtV+λ(δtV+γδt+1V)+λ2(δtV+γδt+1V+γ2δt+2V)+)=(1λ)(δtV(1+λ+λ2+)+γδt+1V(λ+λ2+λ3+)+)=(1λ)(δtV(11λ)+γδt+1V(λ1λ)+γ2δt+1V(λ21λ)+)=l=0(γλ)lδt+lV\begin{align*} \hat{A}_t^{\text{GAE}(\gamma, \lambda)} &\coloneqq (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}\hat{A}_t^{(n)} \\ &= (1-\lambda)\big(\delta_t^V + \lambda(\delta_t^V + \gamma \delta_{t+1}^V) + \lambda^2(\delta_t^V + \gamma \delta_{t+1}^V + \gamma^2 \delta_{t+2}^V) + \cdots \big) \\ &= (1-\lambda)\big(\delta_t^V(1 + \lambda + \lambda^2 + \cdots) + \gamma \delta_{t+1}^V(\lambda + \lambda^2 + \lambda^3 + \cdots) + \cdots \big) \\ &= (1-\lambda)\big(\delta_t^V(\frac{1}{1-\lambda}) + \gamma \delta_{t+1}^V(\frac{\lambda}{1 - \lambda}) + \gamma^2 \delta_{t+1}^V(\frac{\lambda^2}{1 - \lambda}) + \cdots \big) \\ &= \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}^V \end{align*}
  • λ=0\lambda=0 -> unbiased regardless of the accuracy of VV, but high variance
  • λ=1\lambda=1 -> unbiased only if V=Vπ,γV = V^{\pi, \gamma}, and low variance

Reference