Reinforcement Learning tutorial

Posted March 18 by Rokas Balsys

Reinforcement learning agents Beyond DQN

Up to this moment, we covered most popular tutorials related to DQN - value-based reinforcement learning algorithms. In deep reinforcement learning, deep Q-learning Networks are relatively simple. To their credit, not only DQN's are comparatively simple, but there are many other deep RL agents that also make efficient use of the training samples that are available to them. That said, DQN agents do have drawbacks. Most notable are:

  1. If the possible number of state-action pairs is relatively large in a given environment, then the Q-function can become extremely complicated, and so it becomes intractable to estimate the optimal Q-value.
  2. Even in situations where finding Q is computationally tractable, DQN's are not great at exploring relative to some other approaches, and so a DQN may not work properly.

Even if DQN’s are sample efficient, they aren’t applicable to solving all problems.
To wrap up deep reinforcement learning, I’ll introduce the types of agents beyond DQN’s (image below). The main categories of deep RL agents are:

  • Value optimization These include DQN agents and their derivatives (e.g., double DQN, dueling DQN) as well as other types of agents that solve reinforcement learning problems by optimizing value functions (including Q-value functions).
  • Imitation learning: agents in this category (e.g., behavioral cloning and conditional imitation learning algorithms) are designed to mimic behaviors that are taught to them through demonstration, by for example showing them how to do some king of task, for example pick and place boxes. Although imitation learning is a fascinating approach, its range of applications is relatively small, and I will not discuss it further.
  • Model optimization: agents in this category learn to predict future states based on (state, action) at a given time-step. An example of one such algorithm is Monte Carlo tree search (MCTS).
  • Policy optimization: Agents in this category learn policies directly, that is, they directly learn the policy function π. I'll cover these in further tutorials.


Policy Gradient and the REINFORCE Algorithm:

In DQN to choose which action to take in each state, we take the action with the highest Q-value (maximum expected future reward at each state). As a consequence, in value-based learning, a policy exists only because of these action-value estimates. In this tutorial we’ll learn a policy-based reinforcement learning technique called Policy Gradients.

I already said that Reinforcement Learning agent main goal is to learn some policy function π that maps the state space S to the action space A. With DQN’s, and indeed with any other value optimization agent, π is learned indirectly by estimating a value function such as the optimal Q-value. With policy optimization agents, π is learned directly instead.

It means that we directly try to optimize our policy function π without worrying about a value function. We’ll directly parameterize π (select an action without a value function).

Policy gradient (PG) algorithms, which can perform gradient ascent on π directly, are exemplified by a particularly well-known Reinforcement Learning algorithm called REINFORCE. The advantage of PG algorithms like REINFORCE is that they are likely to converge on an optimal solution, so they’re more widely applicable than value optimization algorithms like DQN.

The trade-off is that PG’s have low consistency. They have higher variance in their performance relative to value optimization approaches like DQN, and so PG’s tend to require a larger number of training samples.

In this tutorial we’ll learn:

  • What is Policy Gradient, its advantages and disadvantages?
  • How to implement it in TensorFlow 1.15 and Keras 2.2.4.

Why using Policy-Based methods?

First, we are going to define a policy network that implements for example AI car driver. This network will take the state of the car (for example, the velocity of the car, the distance between the car and the track road lines) and decide what we should do (steer left or right, hit the gas pedal or hit the brake). It is called Policy-Based Reinforcement Learning because we will directly parametrize the policy. Here is how our policy will look: $${\pi}_{Q}(s, a) = P[a|s, Q]$$ Here:
- $ s $ - state value;
- $ a $ - action value;
- $ Q $ - model parameters of the policy network. We can think that policy is the agent’s behavior, i.e. a function to map from state to action.

A policy can be either deterministic or stochastic. A deterministic policy is policy that maps state to actions. You give it a state and the function return an action to take:

  • Deterministic policies are used in deterministic environments. These are environments where the actions taken determine the outcome. There is no uncertainty. For instance, when you play chess and you move your pawn from A2 to A3, you’re sure that your pawn will move to A3.
    Deterministic policy: $a=μ(s)$

    On the other hand, a stochastic policy outputs a probability distribution over actions.
  • Stochastic policy means that instead of being sure of acting a (for instance left), there is a probability we’ll take a different one (in this case 30% that we take right). The stochastic policy is used when the environment is uncertain. This process is called a Partially Observable Markov Decision Process (POMDP).

    Most of the time we’ll use this second type of policy.
    Stochastic policy: ${\pi}(a|s)=P[a|s]$

Advantages of using Policy Gradients

Policy-based methods have better convergence properties. The problem with value-based methods is that they can have a big oscillation while training. This is because the choice of action may change dramatically for an arbitrarily small change in the estimated action values.

On the other hand, with policy gradient, we just follow the gradient to find the best parameters. We have a smooth update of our policy at each step.

Because we follow the gradient to find the best parameters, we’re guaranteed to converge on a local maximum (worst case) or global maximum (best case):


The second advantage is that policy gradients are more effective in high dimensional action spaces, or when using continuous actions.

The problem with Deep Q-learning is that their predictions assign a score (maximum expected future reward) for each possible action, at each time step, given the current state.

But what if we have an infinite possibility of actions? Let’s take same example, with a self-driving car, at each state we can have a (near) infinite choice of actions (turning the wheel at 15°, 20°, 24°, …). We’ll need to output a Q-value for each possible action!

On the other hand, in policy-based methods, we just adjust the parameters directly: Agent will start to understand what the maximum will be, rather than computing (estimating) the maximum directly at every step.


A third advantage is that policy gradient can learn a stochastic policy, while value functions can’t. This has two consequences.

One of these is that we don’t need to implement an exploration/exploitation trade off. A stochastic policy allows our agent to explore the state space without always taking the same action. This is because it outputs a probability distribution over actions. Consequently, it handles the exploration/exploitation trade off without hard coding it.

We also get rid of the problem of perceptual aliasing. Perceptual aliasing is when we have two states that seem to be (or actually are) the same but need different actions. For example, consider the game rock-paper-scissors. A deterministic policy, e.g., playing only rock, can be easily exploited, so a stochastic policy tends to perform much better.

Let’s take a following example. We have an intelligent gold digger, and its goal is to find the gold and avoid being killed.


Perceptual aliasing is when our agent is not able to differentiate the best action to take in a similar-looking state. The dark grey squares are identical states in the eyes of our agent, and a deterministic agent would take the same action in both states.

Suppose our agent’s goal is to get to the treasure while avoiding the fire pits. The two dark grey states are perceptually aliased; in other words, the agent can’t differentiate the two because they both look identical. In the case of a deterministic policy, the agent would perform the same action for both of those states and never get to the treasure. The only hope is the random action selected by the epsilon-greedy exploration technique. However, a stochastic policy could move either left or right, giving it a higher likelihood of reaching the treasure.

As with Markov Decision Processes, one disadvantage of policy-based methods is that they generally take longer to converge and evaluating the policy is time-consuming. Another disadvantage is that they tend to converge to local optimum rather than the global optimum.

Regardless of these pitfalls, policy gradients tend to perform better than value-based reinforcement learning agents at complex tasks. In fact, many of the advancements in reinforcement learning beating humans at complex games such as DOTA use techniques based on policy gradients as we’ll see in future tutorials.



Naturally, Policy gradients have one big disadvantage. A lot of the time, they converge on a local maximum rather than on the global optimum. Another disadvantage of policy-based methods is that they generally take longer to converge and evaluating the policy is time-consuming.

Instead of Deep Q-Learning, which always tries to reach the maximum, policy gradients converge slower, step by step. They can take longer to train.

However, we’ll see there are solutions to this problem. I already mentioned that we have our policy π that has a parameter Q. This π outputs a probability distribution of acting in a given state S with parameters theta: $${\pi}_{Q}(s, a) = P[a|s, Q]$$

But how do we know if our policy is good? Remember that policy can be seen as an optimization problem. We must find the best parameters (Q) to maximize a score function, J(Q). There are two steps:

  • Measure the quality of a π (policy) with a policy score function J(Q).
  • Use policy gradient ascent to find the best parameter Q that improves our π.

The main idea here is that J(Q) will tell us how good our π is. Policy gradient ascent will help us to find the best policy parameters to maximize the sample of good actions.

Regardless of these pitfalls, policy gradients tend to perform better than value-based reinforcement learning agents at complex tasks. In fact, many of the advancements in reinforcement learning beating humans at complex games such as DOTA use techniques based on policy gradients as we’ll see in coming tutorials.

Policy Score function J(Q)

To measure how good our policy is, we use a function called the objective function (or Policy Score Function) that calculates the expected reward of policy. Three methods work equally well for optimizing policies. The choice depends only on the environment and the objectives we have. First, in an episodic environment, we can use the start value. Calculate the mean of the return from the first time-step (G1). This is the cumulative discounted reward for the entire episode: $$J_1(Q)=E_\pi[G_1=R_1+\gamma+\gamma^2R_3+...]=E_\pi(V(s_1))$$ Here:
- $ [G_1=R_1+\gamma+\gamma^2R_3+...] $ - cumulative discounted reward starting at start state;
- $ E_\pi(V(s_1)) $ - value of state 1.

The idea is simple, if we always start in some state $s1$, what’s the total reward we’ll get from that start state until the end? For example, in Pong, I play a new game, but I lost the ball in the first round (end of the game). New round always begins at the same state. I calculate the score using $J_1(Q)$. At the beginning loosing is fine, but I want to improve the score. To do that, I’ll need to improve the probability distributions of my actions by tuning the parameters.

In a continuous environment, we can use the average value, because we can’t rely on a specific start state.

Each state value is now weighted (because some happen more than others) by the probability of the occurrence of the respected state: $$ J_{avgv}(Q) = E_\pi(V(s)) = \sum d(s)V(s) $$ In above function we can calculate d(s) following: $$ d(s) = \frac{N(s)}{\sum_{s'}N(s')} $$ Here:
- $ N(s) $ - Number of occurrences of the state;
- $ \sum_{s'}N(s') $ - Total number occurrences of all states.
Finally, we can use the average reward per time step. The idea here is that we want to get the most reward per time step: $$ J_{avR}(Q) = E_\pi(r) = \sum_s d(s) \sum_a \pi Q(s,a) R_s^a $$ Here:
- $ \sum_s d(s) $ - Probability that I am in state s;
- $ \sum_a \pi Q(s,a) $ - Probability that I will take this action a from that state under this policy;
- $ R_s^a $ - Immediate reward that I will get.

Policy gradient ascent
We have a Policy score function that tells us how good our policy is. Now, we want to find a parameter Q that maximizes this score function. Maximizing the score function means finding the optimal policy.

To maximize the score function J(Q), we need to do gradient ascent on policy parameters.

Gradient ascent is the inverse of gradient descent (gradient descent always points to the steepest change). In gradient descent, we take the direction of the steepest decrease in the function. In gradient ascent we take the direction of the steepest increase of the function.

Why gradient ascent and not gradient descent? Because we use gradient descent when we have an error function that we want to minimize.

But, the score function is not an error function! It’s a score function, and because we want to maximize the score, we need gradient ascent.

The idea is to find the gradient to the current policy π that updates the parameters in the direction of the greatest increase, and iterate.

Let’s implement policy gradient ascent mathematically. This part is hard, but it’s fundamental to understand how to arrive at our gradient formula.

So, we want to find the best parameters $Q*$, that maximize the score: $$ Q* = argmaxE_{\pi Q}[\sum_t R(s_t, a_t)] $$ Here $E_{\pi Q}[\sum_t R(s_t, a_t)] = J(Q)$, so our score function can be defined as: $$ J(Q)=E_\pi[T(\tau)] $$ Here:
- $ E_{\pi Q} $ - Expected given policy;
- $ \tau $ - Expected future reward.

Above formula show us the total summation of expected reward with given policy. Now, because we want to do gradient ascent, we need to differentiate our score function J(Q).

Our score function J(Q) can be also defined as: $$ J_1(Q) = V_{\pi Q}(s1) = E_{\pi Q} = \sum_{s\owns S}d(s) \sum_{a\owns A}\pi_Q(S,a)R_s^a $$ Here:
- $ \sum_{s\owns S}d(s) $ - State distribution;
- $ \sum_{a\owns A}\pi_Q(S,a)R_s^a $ - Action distribution.

I wrote the function in this way to show the problem we face here. We know that policy parameters change how actions are chosen, and as a consequence, what rewards we get and which states we will see and how often.

So, it can be challenging to find the changes of policy in a way that ensures improvement. This is because the performance depends on action selections and the distribution of states in which those selections are made.

Both of these are affected by policy parameters. The effect of policy parameters on the actions is simple to find, but how do we find the effect of policy on the state distribution? The function of the environment is unknown. As a consequence, we face a problem: how do we estimate the ∇ (gradient) with respect to policy Q, when the gradient depends on the unknown effect of policy changes on the state distribution?

The solution will be to use the Policy Gradient Theorem. This provides an analytic expression for the gradient ∇ of J(Q) (performance) with respect to policy Q that does not involve the differentiation of the state distribution.

So let’s continue calculating above $ J(Q)=E_\pi[T(\tau)] $ expression, we can write: $$ ∇_Q J(Q) = ∇_Q \sum_\tau \pi(\tau;Q)R(\tau) $$ In above formula instead of $ ∇_Q \sum_\tau \pi(\tau;Q) $ we'll put the gradient inside. Now, we’re in a situation of stochastic policy. This means that our policy outputs a probability distribution $\pi(\tau;Q)$. It outputs the probability of taking these series of steps (s0, a0, r0…), given our current parameters Q.

But, differentiating a probability function is hard, unless we can transform it into a logarithm. This makes it much simpler to differentiate. So, we will use the likelihood ratio trick that replaces the resulting fraction into log probability. Likelihood ratio trick looks like this: $$ ∇\log x = \frac{∇x}{x} $$ So, from above expression we can write: $$ \pi(\tau;Q) \frac{∇_Q \pi(\tau;Q)}{\pi(\tau;Q)} $$ So, with above expression we can transform $ ∇_Q \sum_\tau \pi(\tau;Q)R(\tau) $ to following expression: $$ \sum_\tau \pi(\tau;Q) ∇_Q(\log \pi(\tau, Q)) R(\tau) $$ Now we can convert the summation back to expectation: $$ ∇_Q J(Q) = E_{\pi}[∇_Q(\log \pi(\tau| Q))R(\tau)] $$ Here:
- $ \pi(\tau| Q) $ - Policy function;
- $ R(\tau) $ - Score function.

As you can see, we only need to compute the derivative of the log policy function.

Now that we’ve done that, and it was a lot, we can conclude about policy gradients: $$ E_\pi[∇_Q(\log \pi(s,a,Q)) R(\tau)] $$ Here:
- $ \pi(s,a,Q) $ - Policy function;
- $ R(\tau) $ - Score function.

And our Update rule looks like this: $$ \triangle Q = \alpha * ∇_Q(\log \pi(s,a,Q)) R(\tau) $$ Here:
- $ \triangle Q $ - Change in parameters;
- $ \alpha $ - Learning rate.

This Policy gradient is telling us how we should shift the policy distribution through changing parameters Q if we want to achieve a higher score.

$R(\tau)$ is like a scalar value score:

  • If $R(\tau)$ is high, it means that on average we took actions that lead to high rewards. We want to push the probabilities of the actions seen (increase the probability of taking these actions).
  • On the other hand, if $R(\tau)$ is low, we want to push down the probabilities of the actions seen.

In simple terms, policy is the mindset of agent. The agent sees the current situation (state) and chooses to choose an action. (or multiple actions) so we define policy as a function of state, that outputs some actions. $ action = f(state) $ we call the above f function, policy gradient policy.

Reward Engineering and why it is important

In all policy gradient agents, one of the most important thing is, how we work with reward. From reward function depends how good our model will be learning.
This is an example, how our rewards would look in our pong game:
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

So the rewards are mostly zero and sometimes we may see 1 or -1 in list. this is because environment just gives us reward when ball passes through us or through opponent. When ball passes our paddle, we lose and get -1 reward and when the opponent fails to catch the ball, we win and get +1.

If we train our network with this reward, it learns nothing because the reward is most of the time zero. in this case the agent does not get feedback what is good or bad. Most gradients would become zero because reward multiplication is used in loss function. Environment rewards us just when the ball passes through our paddle (we score -1) or our opponent's paddle(we score +1).

Now think about it:
Our paddle moves up and down, hits the ball and the ball travels to the left side of the screen. the opponent fails to catch it and we win and get a score of 1. In this scenario, which actions were actually good?
1. the action we took when we hit the ball?
2. the action we took when we get score of 1?

Of course the first option is correct. The second option is wrong, because when the ball is reaching the opponent, our action does not matter, only the action that we took to hit the ball were important to winning and everything after hitting the ball was irrelevant to our win. (a rather similar argument can be discussed about losing, the actions near when we got score of -1 is more important than actions taken many steps earlier).

So here is the trick. We set the reward of actions taken before each reward, similar to the reward obtained. For example, if we got reward +1 at time 200, we say that reward of time 199 is +0.99, reward of time 198 is +0.98 and so on. With this reward definition, we have the rewards for actions that actually resulted in a +1 or -1. and we assume the more recent the action to the reward gained, the more important it is.

We define a function below, that does this for us, note that this may not be applicable in other environments, but serves our purposes for this Pong game project.

def discount_rewards(self, reward):
    # Compute the gamma-discounted rewards over an episode
    gamma = 0.99    # discount rate
    running_add = 0
    discounted_r = np.zeros_like(reward)
    for i in reversed(range(0,len(reward))):
        if reward[i] != 0:
            running_add = 0 # (pong specific game boundary!)
        running_add = running_add * gamma + reward[i]
        discounted_r[i] = running_add

    return discounted_r

Let's see what this gives us:


You can see that in the above chart we got better distribution for rewards. Reward is -1 where it was -1 and now steps before it has some value near -1 (for example -0.8) and similar process happens for positive rewards.

Furthermore we refine the function to subtract rewards by mean and divide by std, and then discuss the results.

def discount_rewards(self, reward):
    # Compute the gamma-discounted rewards over an episode
    gamma = 0.99    # discount rate
    running_add = 0
    discounted_r = np.zeros_like(reward)
    for i in reversed(range(0,len(reward))):
        if reward[i] != 0:
            running_add = 0 # (pong specific game boundary!)
        running_add = running_add * gamma + reward[i]
        discounted_r[i] = running_add

    discounted_r -= np.mean(discounted_r) # normalizing the result
    discounted_r /= np.std(discounted_r) # divide by standard deviation
    return discounted_r

Let's see our reformatted rewards:


So now we have rewards bigger than -0.9 most of the time.

In points where we got -1 reward, the new score is about -0.9 as seen above and steps before it also has negative rewards. The negativity of the reward goes down as we go to the left from each -0.9 score.

Also you can see that we got some positive scores in our new chart. this is because we subtracted by mean. think about what it can result?

The positive reward happens in the long sequences of playing. these are situations where we catch the ball and didn't get a -1 immediately, but we lost later because we failed to catch the ball when it came toward us.

Setting a negative score is logical because we ended up getting a -1. Setting a positive score could be justified by that we approve actions where we catch the ball and did not get -1 reward immediately (maybe we didn't win either).

In some point you can see, that even when we didn't receive a positive score some hills are higher than other, this is because our agent succeeded to survive longer than usual, this also helps for learning. So now that we have defined our rewards, we can go to next step and start training the network.

Policy Gradient implementation with Pong:

So, it was quite long theoretical tutorial, and it's usually hard to understand everything with plain text, so let's do this with basic Pong example and implement it in Keras. Differently from my previous tutorials, I will try 'PongDeterministic-v4' environment and 'Pong-v0'. 'PongDeterministic-v4' is easier to learn, so if our model will be able to learn this simpler environment, we will try 'Pong-v0'. You may ask why it's simpler if it's the same pong game, actually deterministic game gives us all game frames, 'Pong-v0' gives us every 4th frame, it means it is 4 times faster so it's much harder to learn playing it.

I used code from my previous DQN tutorial and rewrote it to Policy Gradient. I am not writing all details here, if you want to see all details, check my above YouTube tutorial.

Also I am not using convolutional layers, because it's quite simple environment, it's faster to train model without CNN.

So, I trained 'PongDeterministic-v4' for 1000 steps, results you can see in bellow graph:


From this training graph, we can see that our model reached average score of 20 somewhere around 700th step. Quite nice, I also trained 'Pong-v0' model, here it is how it performed:


From above graph we can see that our agent learned to play pong, but it was very unstable. Although it plays pong much better than random agent but testing this model would perform quite bad.

As you can see, Policy Gradient model learned to play and score some points. I made our model of one deep layer because of simplicity, I didn't want to make perfect agent, but if you want, you can try and train the network a lot more so they can play better. You can try playing around with CNN, but then you would need to decrease learning rate, but as a result you will get smaller model, because now model takes around 100 MB of size, so I can't upload trained model to GitHub.


From this tutorial code I can say that we face a problem with this algorithm. Because we only calculate reward at the end of the episode, we average all actions. Even if some of the actions taken were very bad, if our score is quite high, we will average all the actions as good.

So to have a correct policy, we need a lot of samples… which results in slow learning.

How to improve our Model?
We’ll see in the next tutorials some improvements:

Actor Critic: a hybrid between value-based algorithms and policy-based algorithms.
Proximal Policy Gradients: ensures that the deviation from the previous policy stays relatively small.