Reinforcement Learning tutorial

Posted November 3 by Rokas Balsys

Epsilon-Greedy in Deep Q learning

In previous tutorial I said, that in next tutorial we'll try to implement Prioritized Experience Replay (PER) method, but before doing that I decided that we should cover Epsilon Greedy method and fix/prepare the source code for PER method. So this will be quite short tutorial.

The epsilon-greedy algorithm is very simple and occurs in several areas of machine learning. One common use of epsilon-greedy is in the so-called multi-armed bandit problem.

Let's take an example. Suppose we are standing in front of k = 3 slot machines. Each machine pays out according to a different probability distribution, and these distributions are unknown to you. And suppose you can play a total of 100 times.

We have two goals. The first goal is to experiment with a few coins to try and determine which machine pays out the best. The second goal is to get as much money as possible. The terms “explore” and “exploit” are used to indicate that we have to use some coins to explore to find the best machine, and we want to use as many coins as possible on the best machine to exploit our knowledge.

Epsilon-greedy is almost too simple. As we play the machines, we keep track of the average payout of each machine. Then, we select the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k) where epsilon is a small value like 0.10. And we select machines that don’t have the highest current payout average with probability = epsilon / k.

Its much easier to understand with a concrete example. Suppose, after our first 12 pulls, we played machine #1 four times and won 1 dollar two times and 0 dollar two times. The average win rate for machine #1 is 2/4 = 0.50 dollars.

And suppose we’ve played machine #2 five times and won 1 dollar three times and 0 dollar two times. The average payout for machine #2 is 3/5 = 0.60 dollars.

And suppose we’ve played machine #3 three times and won 1 dollar one time and 0 dollar two times. The average payout for machine #3 is 1/3 = 0.33 dollars.

Now we have to select a machine to play on. We generate a random number p, between 0.0 and 1.0. Suppose we have set epsilon = 0.10. If p > 0.10 (which it will be 90% of the time), we select machine #2 because it has the current highest average payout. But if p < 0.10 (which it will be only 10% of the time), we select a random machine, so each machine has a 1/3 chance of being selected.

Notice that machine #2 might get picked anyway because we select randomly from all machines.

Over time, the best machine will be played more and more often because it will pay out more often. In short, epsilon-greedy means pick the current best option ("greedy") most of the time, but pick a random option with a small (epsilon) probability sometimes.

There are many other algorithms for the multi-armed bandit problem. But epsilon-greedy is incredibly simple, and often works as well as, or even better than, more sophisticated algorithms such as UCB ("upper confidence bound") variations.

The idea is that in the beginning, we’ll use the epsilon greedy strategy:

  • We specify an exploration rate “epsilon,” which we set to 1 in the beginning. This is the rate of steps that we’ll do randomly. In the beginning, this rate must be at its highest value, because we don’t know anything about the values in Q-table. This means we need to do a lot of exploration, by randomly choosing our actions.
  • We generate a random number. If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step). Else, we’ll do exploration.
  • The idea is that we must have a big epsilon at the beginning of the training of the Q-function. Then, reduce it progressively as the agent becomes more confident at estimating Q-values.

So, what we changed in code? Actually, it almost didn't changed. We inserted self.epsilon_greedy = True # use epsilon greedy strategy Boolean option if we want to use this method or not. Exploration hyper-parameters for epsilon and epsilon greedy strategy will be quite the same:

self.epsilon = 1.0  # exploration probability at start
self.epsilon_min = 0.01  # minimum exploration probability
self.epsilon_decay = 0.0005  # exponential decay rate for exploration prob

From our previous tutorial we have following "remember" function:

def remember(self, state, action, reward, next_state, done):
    self.memory.append((state, action, reward, next_state, done))
    if len(self.memory) > self.train_start:
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

We change it to following:

def remember(self, state, action, reward, next_state, done):
    experience = state, action, reward, next_state, done

Before we were checking if our memory list is equal to self.train_start. From now on we'll do training from our first episode of the game, we won't fill memory with samples. So, we are removing all lines and we'll implement epsilon greedy function in different places.

Until now we were doing following "act" function:

def act(self, state):
    if np.random.random() <= self.epsilon:
        return random.randrange(self.action_size)
        return np.argmax(self.model.predict(state))

But now we'll implement another epsilon greedy function, where we could change our used epsilon method with Boolean. Here we'll use an improved version of our epsilon greedy strategy for Q-learning where we reduce epsilon progressively as the agent becomes more confident at estimating Q-values. The function is almost the same, we simply inserted explore_probability calculation into epsilon greedy strategy:

def act(self, state, decay_step):
    if self.epsilon_greedy:
    # Here we'll use an improved version of our epsilon greedy strategy for Q-learning
        explore_probability = self.epsilon_min + (self.epsilon - self.epsilon_min) * np.exp(-self.epsilon_decay * decay_step)
        if self.epsilon > self.epsilon_min:
            self.epsilon *= (1-self.epsilon_decay)
        explore_probability = self.epsilon

    if explore_probability > np.random.rand():
        # Make a random action (exploration)
        return random.randrange(self.action_size), explore_probability
        # Get action from Q-network (exploitation)
        # Estimate the Qs values state
        # Take the biggest Q value (= the best action)
        return np.argmax(self.model.predict(state)), explore_probability

In all code I am removing the following lines, to start training agent from first steps:

if len(self.memory) < self.train_start:

And finally, we are modifying our def run(self): function:

  • Inserting decay_step = 0 variable and decay_step += 1 function in while loop
  • Instead of old action = self.act(state) function we call it in following way: action, explore_probability = self.act(state, decay_step)
  • We change our logging print line, where before we were printing self.epsilon, now we'll print explore_probability in this place.

So, there weren’t a lot of changes, but we still need to do them, here is the full tutorial code on GitHub link:

Here is results of epsilon greedy strategy playing it for 1000 steps:


I won't do a comparison how our agent performs with epsilon greedy and without greedy, because before we were using very similar strategy to reduce epsilon, so there won't be a significant difference. But for your interest above is the results how epsilon greedy performs.

In this short tutorial we simply prepared our code for coming tutorial, where we'll integrate Priority Experience Replay method to our agent, with this method our agent should improve significantly, so keep watching!