Surya Dantuluri's Blog

## What is a Policy Gradient and how it is useful?

Arguably the goal in Reinforcement learning is to find an optimal policy. By nature this optimal policy results in relatively highest rewards(relative to other non-optimal policies). With policy gradients, we technically incentivize the distribution of actions to generate a higher reward and likewise deter distributions of actions that generate sub-optimal rewards. Overtime we generate better trajectories, creating the optimal policy.

## Derivation

The goal of RL is to optimize the objective function.

$$\definecolor{red}{RGB}{255,59,48}\definecolor{orange}{RGB}{255,149,0}\definecolor{yellow}{RGB}{255,204,0}\definecolor{green}{RGB}{76,217,100}\definecolor{tealblue}{RGB}{90,200,250}\definecolor{blue}{RGB}{0,122,255}\definecolor{purple}{RGB}{88,86,214}\definecolor{pink}{RGB}{255,45,85}$$

$$\pi_{\theta}^\star = \text{arg}\underset{\pi_{\theta}}{\max}\color{orange}E_{\tau\sim p_{\pi_{\theta}}(\tau)}[\sum_{t} r(s_t,a_t)]$$

What we're doing here is taking all the state action pairs along the trajectory, $(s_t, a_t)$, summing their rewards up, finding the total reward, and maximizing them with respect to $\theta$, the parameters of the policy.

Let's denote the expectation of the sum $\color{orange}E_{\tau\sim p_{\pi_{\theta}}(\tau)}[\sum_{t} r(s_t,a_t)]$ as $\color{red}J(\pi_{\theta})$.

Now that we have $\color{red}J(\pi_{\theta})$ what do we do? Find the gradient to optimize this expectation via gradient ascent.

Expanding $J_{\theta}$ from expectation form (using definition of expectation):

$$\color{red}J(\pi_{\theta}) \color{black}=\color{orange} E_{\tau\sim p_{\pi_{\theta}}(\tau)}[\sum_{t} r(s_t,a_t)] \color{black}= \color{green}\int \color{blue}P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

$$\nabla_{\theta}\color{red}J(\pi_{\theta}) \color{black}= \nabla_{\theta}\color{green}\int \color{blue}P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

$$= \color{green}\int \color{black}\nabla_{\theta}\color{blue}P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

Using the log derivative trick:

$$\nabla_{\theta}\color{blue}P(\tau|\pi_{\theta}) \color{black}= \color{blue}P(\tau|\pi_{\theta})\frac{\color{black}\nabla_{\theta}\color{blue}P(\tau|\pi_{\theta})}{P(\tau|\pi_{\theta})} = \color{yellow}P(\tau|\pi_{\theta})\color{black}\nabla_{\theta}\color{yellow}\log P(\tau|\pi_{\theta})$$

Continuing from our previous step:

$$= \color{green}\int \color{yellow}P(\tau|\pi_{\theta})\color{black}\nabla_{\theta}\color{black}\log P(\tau|\pi_{\theta})\color{green}R(\tau)d\tau$$

Going back to expectation form we get:

$$= \underset{\tau\sim\pi_{\theta}}{E}[\nabla_{\theta}\color{yellow}\log P(\tau|\pi_{\theta})\color{green}R(\tau)\color{black}]$$

But we still don't have $P(\tau|\pi_{\theta})$ which by the way is the probability of the trajectory

$$P(\tau|\pi_{\theta}) = corr(s_{0})\prod_{t=1}^{T} P(s_{t+1}|s_{t},a_{t})\pi_{\theta}(a_{t}|s_{t})$$

Taking the log of both sides to help us simplify our previous expectation:

$$\log P(\tau|\pi_{\theta}) = \log corr(s_{o}) + \sum_{t=1}^{T} (\log P(s_{t+1}|s_{t},a{t})+ \log \pi_{\theta}(a_t|s_t))$$

$$\nabla_{\theta}\log P(\tau|\pi_{\theta}) = \nabla_{\theta} \log corr(s_{o}) + \sum_{t=1}^{T} (\log P(s_{t+1}|s_{t},a{t})+ \log \pi_{\theta}(a_t|s_t))$$

And since  $\log corr(s_{o})$ and $\log \pi_{\theta}(a_t|s_t))$ don't depend on $\theta$, we can simplify the the gradient of the expectation to:

$$\nabla_{\theta}J(\pi_{\theta}) = \underset{\tau\sim\pi_{\theta}}{E}[\cancel{\nabla_{\theta}\log corr(s_{o})} + \nabla_{\theta}\color{yellow} \sum_{t=1}^{T} (\log P(s_{t+1}|s_{t},a{t})+ \log \pi_{\theta}(a_t|s_t))\color{green}R(\tau)\color{black}]$$

Unfinished derivation (finished the implementation which I will soon update here). Currently working on a paper and will come back to finish this post.

writes articles on Machine Learning, Full Stack Development, and Insightful Topics