Multi-step returns in RL
Monte-Carlo return
Rt=rt+γrt+1+γ2rt+2+…+γT−trT=l=0∑T−tγlrt+l
n-step return
Rt(n)=rt+γrt+1+γ2rt+2+…+γn−1rt+n−1+γnV(st+n)=l=0∑n−1[γlrt+l]+γnV(st+n)
Basically everything after n-th step is bootstrapped with the value function
:::message
n→∞ recovers Monte-Carlo return.
This assumes the rewards after horizon T are all zeros:
Rt(∞)=l=0∑T−t[γlrt+l]+l=T−t+1∑∞[γl⋅0]+γ∞V(st+∞)=l=0∑T−t[γlrt+l]=Rt.
:::
n | what? | expression | biased? | variance? |
---|
1 | 1-step return | rt+γV(st+1) | 🙅♂️ biased | 🙆♂️ low-variance |
⋮ | | | | |
∞ | Monte-Carlo return | ∑l=0T−tγlrt+l | 🙆♂️ unbiased | 🙅♂️ high-variance |
1-step return is common in Q-learning.
Notice that n acts as a trade-off between bias and variance.
Why don't we mix these? --> λ-return
λ-return (TD(λ))
Exponentially weighted average of n-step returns
Rt(λ)=(1−λ)n=1∑∞λn−1Rt(n)
Q. Why do we need (1−λ) there?:
It's necessary for normalization, as you can see from
(1−r)k=1∑∞ark−1=a
:::details Basics of Geometric progression
Geometric Progression
k=1∑nark−1=ar0+ar1+ar2+⋯+arn−1
A trick: multiply (1−r) on both sides
(1−r)k=1∑nark−1=(1−r)(ar0+ar1+ar2+⋯+arn−1)=(1−r)(ar0+ar1+ar2+⋯+arn−1)=(ar0+ar1+ar2+⋯+arn−1)−(ar1+ar2+ar3+⋯+arn)=a−arn=a(1−rn)
Thus,
k=1∑nark−1=1−ra(1−rn)
If n→∞,
k=1∑∞ark−1=1−ra
:::
- λ=0⟹Rt(1): 1-step return
- λ=1⟹Rt(∞)(=Rt): Monte-Carlo return
Generalized Advantage Estimator: GAE(λ)
In one word: we simply repeat the same discussion for advantages (TD residuals), instead of returns.
This simply estimates advantage with λ-return: A^t=Rt(λ)−V(st).
Basics: Value function and Q-function
\begin{align*}
V^{\pi, \gamma}(s_t) &\coloneqq \mathbb{E_{
\substack{s_{t+1:\infty}, \\ {\color{blue} a_{t:\infty}}}
}}[\sum_{l=0}^\infty \gamma^l r_{t+l}] \\
Q^{\pi, \gamma}(s_t, a_t) &\coloneqq \mathbb{E}_{
\substack{s_{t+1:\infty} \\ {\color{blue} a_{t+1:\infty}}}
}[\sum_{l=0}^{\infty} \gamma^l r_{t+l}] \\
{/* &= r_t + \gamma \cdot \mathbb{E}_{\substack{s_{t+2:\infty} \\ a_{t+1:\infty}}}[\sum_{l=0}^{\infty} \gamma^l r_{t+l}] \\ */}
{/* &= r_t + \gamma V^{\pi, \gamma}(s_{t+1}) */}
\end{align*}
Advantage is defined as their difference:
Aπ,γ(st,at):=Qπ,γ(st,at)−Vπ,γ(st)
TD residual
Now, we define TD residual δtV:
δtVπ,γ=rt+γVπ,γ(st+1)−Vπ,γ(st)
This is in fact, an unbiased estimator of Aπ,γ(st,at):
Est+1[δtVπ,γ]=Est+1[rt+γVπ,γ(st+1)−Vπ,γ(st)]=Est+1[Qπ,γ(st,at)−Vπ,γ(st)]=Aπ,γ(st,at)
n-step TD residual
A^t(n):=l=0∑n−1γlδt+lV=δtV+γδt+1V+γ2δt+2V+⋯+γn−1δt+n−1n−1=∗rt+γrt+1+ ⋯ +γn−1rt+n−1+ γnV(st+n)−V(st)=l=0∑n−1[γlrt+l]+γnV(st+n)−V(st)=Rt(n)−V(st)
* Notice that this is a telescoping sum.
:::message
n→∞ becomes: [empirical returns] - [value function baseline]
A^t(∞)=l=0∑∞γlδt+lV=l=0∑∞[γlrt+l]−V(st)
:::
GAE
Exponentially weighted average of n-step TD residuals
A^tGAE(γ,λ):=(1−λ)n=1∑∞λn−1A^t(n)=(1−λ)(δtV+λ(δtV+γδt+1V)+λ2(δtV+γδt+1V+γ2δt+2V)+⋯)=(1−λ)(δtV(1+λ+λ2+⋯)+γδt+1V(λ+λ2+λ3+⋯)+⋯)=(1−λ)(δtV(1−λ1)+γδt+1V(1−λλ)+γ2δt+1V(1−λλ2)+⋯)=l=0∑∞(γλ)lδt+lV
- λ=0 -> unbiased regardless of the accuracy of V, but high variance
- λ=1 -> unbiased only if V=Vπ,γ, and low variance
Reference