Reinforcement Learning tutorial
Posted March 25 by Rokas Balsys
Introduction to Proximal Policy Optimization:
In 2018 OpenAI made a breakthrough in Deep Reinforcement Learning. This breakthrough was made possible thanks to a strong hardware architecture and by using the state of the art's algorithm: Proximal Policy Optimization.
The main idea of Proximal Policy Optimization is to avoid having too large a policy update. To do that, we use a ratio that tells us the difference between our new and old policy and clip this ratio from 0.8 to 1.2. Doing that will ensure that the policy update will not be too large.
In this tutorial, we'll dive into the understanding of the PPO architecture and we'll implement a Proximal Policy Optimization (PPO) agent that learns to play Pongv0.
However, if you want to understand PPO, you need first to check all my previous tutorial. In this tutorial as a backbone, I will use A3C tutorial code.
Problem with Policy Gradient
If you were reading my tutorial about Policy Gradient, we talked about Policy Loss function:
$$ L^{PG}(Q)=E_t[log\pi_Q(a_ts_t)]*A_t $$
Here:
 $ L^{PG}(Q) $  Policy Loss
 $ E_t $  Expected
 $ log\pi_Q(a_ts_t) $  log probability of taking that action at that state
 $ A_t $  Advantage
The idea of this function is doing gradient ascent step (which is equivalent of taking reversed gradient descent), this way our agent is pushed to take actions that lead to higher rewards and avoid bad actions.
However, the problem comes from the step size:
 Too small, the training process is too slow
 Too high, there is too much variability in the training.
That's where PPO is useful, the idea is that PPO improves the stability of the Actor training by limiting the policy update at each training step.
To be able to do that PPO introduced a new objective function called “Clipped surrogate objective function” that will constraint the policy change in a small range using a clip.
Clipped Surrogate Objective Function
First, as explained in PPO paper, instead of using log pi to trace the impact of the actions, PPO use the ratio between the probability of action under current policy divided by the probability of the action under previous policy: $$ r_t(Q) = \frac{\pi_Q(a_ts_t)}{\pi_{Q_{old}}(a_ts_t} $$ As we can see $ r_t(θ) $ is the probability ratio between the new and old policy:
 If $ r_t(θ) $ >1, it means that the action is more probable in the current policy than the old policy.
 If $ r_t(θ) $ is between 0 and 1: it means that the action is less probable for current policy than for the old one.
As consequence, we can write our new objective function:
$$ L^{CPI}(Q)= \hat E_t [\frac{\pi_Q(a_ts_t)}{\pi_{Q_{old}}(a_ts_t}\hat A_t] = \hat E_t[r_t(Q)\hat A_t] $$
However, if the action taken more probable in our current policy than in our former, this would lead to a large policy gradient step and consequence an excessive policy update.
Consequently, we need to constraint this objective function by penalize changes that lead to a ratio (in the paper ratio can only vary from 0.8 to 1.2). By doing that we'll ensure that not having too large policy update because the new policy can't be too different from the older one. To do that we have to use PPO clip probability ratio directly in the objective function with its Clipped surrogate objective function:
$$ L^{CLIP}(Q)= \hat E_t [min(r_t(Q)\hat A_t, clip(r_t(Q),1\in,1+\in)\hat A_t)] $$
With Clipped Surrogate Objective function, we have two probability ratios, one non clipped and one clipped in a range (between $[1\in, 1+\in]$, epsilon is an hyper parameter that helps us to define this clip range (in the paper $\in$ = 0.2).
If we take the minimum of the clipped and non clipped objective, the final objective would be a lower bound (pessimistic bound) of the unclipped objective. Consequently, we have two cases to consider:

When the advantage is > 0:
If advantage > 0, it means that the action is better than the average of all the actions in that state. Therefore, we should encourage our new policy to increase the probability of taking that action at that state.
This means increasing $ r_t $, because we increase the probability at new policy and the denominator of old policy stay constant: $$ L^{CPI}(Q)= \hat E_t [\frac{\pi_Q(a_ts_t)}{\pi_{Q_{old}}(a_ts_t}\hat A_t] = \hat E_t[r_t(Q)\hat A_t] $$ However, because of the clip, $r_t(Q)$ will only grow to as much as $1+\in$. it means that this action can't be hundred times probable compared to old policy (because of the clip). This is done because we don't want to update our policy too much. Taking action at specific state is only one try, it doesn't mean that it will always lead to a super positive reward, so we don't want to be too much greedy because it also can lead to bad policy.
In summary, in the case of positive advantage, we want to increase the probability of taking that action at that step but not too much. 
When the advantage is < 0:
If advantage < 0, the action should be discouraged because negative effect of the outcome. Consequently, $ r_t $ will be decreased (because action is less probable for current policy than for the old one) but because of the clip, $ r_t $ will only decreases to as little as $1\in$.
Also, we don't want to make a big change in the policy by being too greedy by completely reducing the probability of taking that, action because it leads to negative advantage.
In summary, thanks to this clipped surrogate objective, the range that the new policy can vary from the old one is restricted, because the incentive for the probability ratio to move outside of the interval is removed. If the ratio is > 1+e or < 1e the gradient will be equal to 0 (no slope), because the clip have the effect to gradient. So, both of these clipping regions prevent us from getting too greedy and trying to update too much at once, and updating outside of the region where this sample offers a good approximation.
The final Clipped Surrogate Objective Loss:
$$ L^{CLIP+VF+S}_t(Q)=\hat E_t[L^{CLIP}_t(Q)c_1 L^VF_t(Q)+c_2 S[\pi_Q](s_t)] $$
Here:
 $ c_1 $ and $ c_2 $  coefficients
 $ S $  denotes an entropy bonus
 $ L^VF_t $  squarederror loss
Implementing a PPO agent in A3C to play Pong:
So now, we're ready to implement a PPO agent in A3C style. This means that I will use A3C code as a backbone and there will be the same processes explained in the A3C tutorial.
So, same as before I will bold code which is new and old code will be stroked through, then you can track changes in code.
So, from theory side it looks that it's really difficult to implement, but in practical we already have prepared code to be simply adopted to use PPO loss function, we will need to mainly change our used loss function and defined replay function:
def OurModel(input_shape, action_space, lr): X_input = Input(input_shape) X = Flatten(input_shape=input_shape)(X_input) X = Dense(512, activation="elu", kernel_initializer='he_uniform')(X) action = Dense(action_space, activation="softmax", kernel_initializer='he_uniform')(X) value = Dense(1, kernel_initializer='he_uniform')(X) def ppo_loss(y_true, y_pred): # Defined in https://arxiv.org/abs/1707.06347 advantages, prediction_picks, actions = y_true[:, :1], y_true[:, 1:1+action_space], y_true[:, 1+action_space:] LOSS_CLIPPING = 0.2 ENTROPY_LOSS = 5e3 prob = y_pred * actions old_prob = actions * prediction_picks r = prob/(old_prob + 1e10) p1 = r * advantages p2 = K.clip(r, min_value=1  LOSS_CLIPPING, max_value=1 + LOSS_CLIPPING) * advantages loss = K.mean(K.minimum(p1, p2) + ENTROPY_LOSS * (prob * K.log(prob + 1e10))) return loss Actor = Model(inputs = X_input, outputs = action) Actor.compile(loss=ppo_loss'categorical_crossentropy', optimizer=RMSprop(lr=lr)) Critic = Model(inputs = X_input, outputs = value) Critic.compile(loss='mse', optimizer=RMSprop(lr=lr)) return Actor, Critic
As you can see, we simply replace 'categorical_crossentropy' loss function to our ppo_loss function.
From act function we will need prediction:
def act(self, state): # Use the network to predict the next action to take, using the model prediction = self.Actor.predict(state)[0] action = np.random.choice(self.action_size, p=prediction) return action, prediction
Prediction from above function is used in replay function:
def replay(self, states, actions, rewards, predictions): # reshape memory to appropriate shape for training states = np.vstack(states) actions = np.vstack(actions) predictions = np.vstack(predictions) # Compute discounted rewards discounted_r = np.vstack(self.discount_rewards(rewards)) # Get Critic network predictions values = self.Critic.predict(states)[:, 0]# Compute advantages advantages = discounted_r  values # stack everything to numpy array y_true = np.hstack([advantages, predictions, actions]) # training Actor and Critic networks self.Actor.fit(states, actions, sample_weight=advantages, y_true, epochs=1self.EPOCHS, verbose=0, shuffle=True, batch_size=len(rewards)) self.Critic.fit(states, discounted_r, epochs=1self.EPOCHS, verbose=0, shuffle=True, batch_size=len(rewards))
Actually, you may ask, what is this np.hstack used for? Because we use custom loss function, we must send data in the right shape of data. So we pack all advantages, predictions, actions to y_true, and when they are received in custom loss function, we unpack it.
Last step to do is adding prediction memory to our run function:
def run(self): for e in range(self.EPISODES): state = self.reset(self.env) done, score, SAVING = False, 0, '' # Instantiate or reset games memory states, actions, rewards, predictions = [], [], [], [] while not done: #self.env.render() # Actor picks an action action, prediction = self.act(state) # Retrieve new state, reward, and whether the state is terminal next_state, reward, done, _ = self.step(action, self.env, state) # Memorize (state, action, reward) for training states.append(state) action_onehot = np.zeros([self.action_size]) action_onehot[action] = 1 actions.append(action_onehot) rewards.append(reward) predictions.append(prediction) # Update current state state = next_state score += reward if done: average = self.PlotModel(score, e) # saving best models if average >= self.max_average: self.max_average = average self.save() SAVING = "SAVING" else: SAVING = "" print("episode: {}/{}, score: {}, average: {:.2f} {}".format(e, self.EPISODES, score, average, SAVING)) self.replay(states, actions, rewards, predictions) self.env.close()
We add same lines to defined train_threading function. That's all! We have just created an agent that can learns to play any atari game. That's awesome! You'll need about 12 to 24 hours of training on 1 GPU to have a good agent with Pongv0.
So this was quite long and interesting tutorial, and here is full completed code:
So same as before, I trained 'PongDeterministicv4' and 'Pongv0', from all my previous RL tutorials results were the best. In PongDeterministicv4 we reached best score in around 400 steps (while A3C needed 800 steps):
And here is results for 'Pongv0' environment :
Results are quite nice, comparing with other RL algorithms. My goal wasn't to make it best, just to compare it with same parameters. I am really satisfied with results!
After training it, I thought that this algorythm should be able to learn playing pong with CNN layers, then my agent would be smaller and I would be able to upload it to GitHub.So changed model architecture in wollowing way:
def OurModel(input_shape, action_space, lr): X_input = Input(input_shape) X = Conv2D(32, 8, strides=(4, 4),padding="valid", activation="elu", data_format="channels_first", input_shape=input_shape)(X_input) X = Conv2D(64, 4, strides=(2, 2),padding="valid", activation="elu", data_format="channels_first")(X) X = Conv2D(64, 3, strides=(1, 1),padding="valid", activation="elu", data_format="channels_first")(X) X = Dense(512, activation="elu", kernel_initializer='he_uniform')(X) action = Dense(action_space, activation="softmax", kernel_initializer='he_uniform')(X) value = Dense(1, kernel_initializer='he_uniform')(X) ...
And I was right! I trained this agent for 10k steps, and results were also quite nice:
Here is the gif, how my agent plays pong:
Conclusion:
Don't forget to implement each part of the code by yourself. It's really important to understand how everything works. Try to modify the hyper parameters, use another gym environment. Experimenting is the best way to learn, so have fun!
I hope this tutorial could give you a good idea of the basic PPO algorithm. You can now move onto building upon this by executing multiple environments in parallel in order to collect more training samples and also solve more complicated game scenarios.
For now this is my last Reinforcement Learning tutorial, this was quite long journey where I really learned a lot. I plan to come back to reinforcement learning later this year, this will be more practical and more interesting tutorial, we will use it in finance!