# Policy Gradient Method

# What is Policy Gradient?

Roughly speaking, Reinforcement Learning methods can be classified into *value-based methods* and *policy-based methods*. *Value-based methods* learn value function from the experience, and agent chooses an action that has the highest value (e.g.,

*Policy-based methods* learn policy *value-based methods* due to its maximization over continuous action space.

Policy gradient method is one of the *policy-based methods*. This method assumes the policy is differentiable, and iteratively calculate the update of parameters using its gradient.

# Why/When do we want Policy Gradient?

There are several cases where policy gradient would have an advantage than other methods.

**When the policy is easier to model than value function.**

For example, in the game of Breakout (Atari), it might not be easy to predict the score from the display, however a simple policy that just follow the ball might work well.**When the optimal policy is not deterministic.**

For example, in the game of rock-paper-scissors, the optimal policy is to choose the hand at uniformly random.

Value based RL methods cannot learn this stochastic policy, since the action with highest state-action value is always selected.**When the action space is continuous.**

There is no natural way to handle continuous action for value based methods, because taking the maximum of a value function over all possible actions is intractable. On the other hand, policy gradient methods directly model the policy and therefore it can naturally handle continuous action space.

# Basics of Policy Gradient

In this post, we focus on non-discounted (i.e.,

As the name suggests, policy gradient method updates policy parameters by gradient ascent.

We assume the policy is differentiable^{[1]}.

Now, we define a function

In episodic cases, it can be defined as the expected returns of following the policy

In the following part, we will calculate the gradient of

, where

\nabla J(\theta)

How to calculate All we need is just to take the derivative of

Well, it's not that easy. It is easy to calcluate how the distribution over actions

Due to the lack of knowledge of the environment, there is no obvious way to calculate the gradient.

But surprisingly, Policy Gradient Theorem^{[2]} shows the gradient is written out in the following beautiful way.

,where

Note that

In (Sutton et al., 2000), which presented the proof, authors commented about this equation as:

their [sic] are no terms of the form

: the effect of policy changes on the distribution of states does not appear. This is convenient for approximating the gradient by sampling. \frac{\partial d^{\pi_{\theta}}(s)}{\partial\theta}

We will put the proof in the end of this post. For here, we show the simpler proof for One-step MDP, where the episode terminates immediately after choosing an action.

I hope this provides an intuition for the equation above.

Let

One can see the correspondence between this and equation (1).

**Interpretation of the equation**

Thus, the policy will be updated so that it chooses an action that the agent has experienced a large reward.

On top of that, the outer sum weighs it according to the probability to visit a state

This means that the update for frequently visited states will have more importance.

### Another form of Policy Gradient Theorem

Equation (2) is also written in the following form.

## Full derivation of Policy Gradient Theorem

## Policy Gradient Theorem

Note: this proof is identical to the proof shown in P.325 in [Sutton and Barto, 2018]

where

Then,