Reinforcement Learning tutorial

Posted March 20 by Rokas Balsys

##### Introduction to Advantage Actor Critic method:

Since the beginning of this RL tutorial series, I've covered two different reinforcement learning methods: Value based methods (Q-learning, Deep Q-learning…) and Policy based methods (REINFORCE with Policy Gradients).

Both of these methods have big drawbacks. That's why, today, I'll try another type of Reinforcement Learning method which we can call a 'hybrid method': Actor-Critic. Actor-Critic algorithm is an RL agent that combines the value optimization and policy optimization approaches. More specifically, the Actor-Critic combines the Q-learning and PG algorithms. At a high level, the resulting algorithm involves a loop that alternates between:

• Actor: a PG algorithm that decides on an action to take;
• Critic: Q-learning algorithm that critiques the action that the actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay.

The advantage of the Actor-Critic algorithm is that it can solve a broader range of problems than DQN, while it has lower variance in performance relative to REINFORCE. That said, because of the presence of the PG algorithm within it, the Actor-Critic is still somewhat sample inefficient.

##### The problem with Policy Gradients:

In my previous tutorial, we derived policy gradients and implemented the REINFORCE algorithm (also known as Monte Carlo policy gradients). There are, however, some issues with vanilla policy gradients: noisy gradients and high variance.

Recall the policy gradient function: $$\triangle J(Q) = E_\tau [\sum_{t=0}^{T-1} ∇_Q \log \pi_Q (a_t, s_t) G_t ]$$ As in the REINFORCE algorithm, we update the policy parameter through Monte Carlo updates (i.e. taking random samples). This introduces in inherent high variability in log probabilities (log of the policy distribution) and cumulative reward values, because each trajectories during training can deviate from each other at great degrees. Consequently, the high variability in log probabilities and cumulative reward values will make noisy gradients, and cause unstable learning and/or the policy distribution skewing to a non-optimal direction. Besides high variance of gradients, another problem with policy gradients occurs trajectories have a cumulative reward of 0. The essence of policy gradient is increasing the probabilities for “good” actions and decreasing those of “bad” actions in the policy distribution; both 'good' and 'bad' actions will not be learned if the cumulative reward is 0. Overall, these issues contribute to the instability and slow convergence of vanilla policy gradient methods. One way to reduce variance and increase stability is subtracting the cumulative reward by a baseline $b(s)$: $$\triangle J(Q) = E_\tau [\sum_{t=0}^{T-1} ∇_Q \log \pi_Q (a_t, s_t) (G_t - b(s_t)) ]$$ Intuitively, making the cumulative reward smaller by subtracting it with a baseline will make smaller gradients, and thus smaller and more stable updates.

##### How Actor Critic works:

Imagine you play a video game with a friend that provides you some feedback. You're the Actor and your friend is the Critic:

At the beginning, you don’t know how to play, so you try some action randomly. The Critic observes your action and provides feedback.

Let's first take a look at the vanilla policy gradient again to see how the Actor Critic architecture comes in (and what is really is): $$\triangle J(Q) = E_\tau [\sum_{t=0}^{T-1} ∇_Q \log \pi_Q (a_t, s_t) G_t ]$$ We can then decompose the expectation into: $$\triangle J(Q) = E_{s_0,a_0,...,s_t,a_t} [\sum_{t=0}^{T-1} ∇_Q \log \pi_Q (a_t, s_t) ] E_{r_{t+1},s_{t+1},...,r_T,s_T}[G_T]$$ The second expectation term should be familiar; it is the Q value! $$E_{r_{t+1},s_{t+1},...,r_T,s_T}[G_T] = Q(s_t,a_t)$$ Plugging that in, we can rewrite the update equation as such: $$\triangle J(Q) = E_{s_0,a_0,...,s_t,a_t} [\sum_{t=0}^{T-1} ∇_Q \log \pi_Q (a_t, s_t) ] Q(s_t,a_t) = E_{\tau}[\sum_{t=0}^{T-1} ∇_Q \log \pi_Q (a_t, s_t) ] Q(s_t,a_t)$$ As we know, the Q value can be learned by parameterizing the Q function with a neural network. This leads us to Actor Critic Methods, where:

• The "Critic" estimates the value function. This could be the action-value (the Q value) or state-value (the V value).
• Critic: Q-learning algorithm that critiques the action that the actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay.

We simply update both the Critic network and the Value network at each update step.

So, using the Value function as the baseline function, we subtract the Q value term with the Value. Intuitively, this means how much better it is to take a specific action compared to the average, general action at the given state. We will call this value the advantage value: $$A(s_t,a_t) = Q(s_t,a_t) - V(a_t)$$ This is so called the Advantage Actor Critic, in code it looks much simplier, you will see.

I am working on my previous tutorial code, we just need to add Critic model to same code. So in Policy Gradient our model looked following:

def OurModel(input_shape, action_space, lr):
X_input = Input(input_shape)

X = Flatten(input_shape=input_shape)(X_input)

X = Dense(512, activation="elu", kernel_initializer='he_uniform')(X)

action = Dense(action_space, activation="softmax", kernel_initializer='he_uniform')(X)

Actor = Model(inputs = X_input, outputs = action)
Actor.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=lr))

return Actor


To make it Actor-Critic, we simply add 'value' parameter and we compile not only Actor model, but and Critic model with 'mse' loss:

def OurModel(input_shape, action_space, lr):
X_input = Input(input_shape)

X = Flatten(input_shape=input_shape)(X_input)

X = Dense(512, activation="elu", kernel_initializer='he_uniform')(X)

action = Dense(action_space, activation="softmax", kernel_initializer='he_uniform')(X)
value = Dense(1, kernel_initializer='he_uniform')(X)

Actor = Model(inputs = X_input, outputs = action)
Actor.compile(loss='categorical_crossentropy', optimizer=RMSprop(lr=lr))

Critic = Model(inputs = X_input, outputs = value)
Critic.compile(loss='mse', optimizer=RMSprop(lr=lr))

return Actor, Critic


Another most important function we change is "def replay(self)". In policy gradient it looked following:

def replay(self):
# reshape memory to appropriate shape for training
states = np.vstack(self.states)
actions = np.vstack(self.actions)

# Compute discounted rewards
discounted_r = self.discount_rewards(self.rewards)

# training PG network
self.Actor.fit(states, actions, sample_weight=discounted_r, epochs=1, verbose=0)
# reset training memory
self.states, self.actions, self.rewards = [], [], []


To make it work as Actor-Critic algorithm, we predict states with out Critic model to get values which we subtract from discounted rewards, this way we calculate advantages. And instead of training Actor with discounted rewards, we use advantages, and for Critic network we use discounted rewards:

def replay(self):
# reshape memory to appropriate shape for training
states = np.vstack(self.states)
actions = np.vstack(self.actions)

# Compute discounted rewards
discounted_r = self.discount_rewards(self.rewards)

# Get Critic network predictions
values = self.Critic.predict(states)[:, 0]
# training Actor and Critic networks
self.Critic.fit(states, discounted_r, epochs=1, verbose=0)
# reset training memory
self.states, self.actions, self.rewards = [], [], []


That's it, we just needed to change few lines of code, moreover you can change 'save' and 'load model' functions. Here is the full code:

Same as in my previous tutorial, I first trained 'PongDeterministic-v4' for 1000 steps, results you can see in bellow graph:

So, from training results, we can say that A2C model played pong quite smoother. Despite that it took a little longer to reach maximum scores, games of it were much more stable than PG, where we had a lot of spikes. Then I thought, ok lets give a change to our 'Pong-v0' environment:

Now our 'Pong-v0' training graph looks much better than in Policy Gradient, much more stable games. But sadly our average score couldn't get more than 11 scores per game. But keep in mind that I am using 1 deep layer network, you can play around with architecture.

##### Conclusion:

So, in this tutorial we implemented hybrid between value-based algorithms and policy-based algorithms. But we still face a problem, that learning for these models take a lot of time. So in next tutorial part I will implement it as Asynchronous A2C algorithm, this means that we will run for example 4 environments at once, and we will train the same main model. In theory this means, we will be able to train our agent 4 times faster, but how it looks in practical you will see in next tutorial part.