Reinforcement Learning tutorial

Posted November 19 by Rokas Balsys

DQN PER with Convolutional Neural Networks

In this tutorial, I am going to show you how to implement one of the most groundbreaking Reinforcement Learning algorithms - DQN with pixels. After the end of this tutorial, you will be able to create an agent that successfully plays almost ‘any’ game using only pixel inputs.

In all my previous DQN tutorials we’ve used game specific inputs (like cart position or pole angle), now we are going to be more general and we will use something that all games have in common - pixels. To begin with, I would like to come back to our first DQN tutorial, where we wrote our first agent code, to take actions randomly. Now we'll do the same thing, but instead of using specific environment inputs, we'll use pixels. This is our random action code from first tutorial:

import gym
import random

env = gym.make("CartPole-v0")

def Random_games():
    for episode in range(10):
        for t in range(500):
            action = env.action_space.sample()
            next_state, reward, done, info = env.step(action)
            print(t, next_state, reward, done, info, action)
            if done:

This short code was playing random games for us, now we'll need to build our own render, GetImage, reset and step functions to work with pixels.

Before we used env.render() function, when we wanted to see how our agent plays the game, in gym documentation it's written that if we'll use env.render(mode='rgb_array') it will return us rendered frame image, this way we'll get our image. Yes, it's sad that we'll need to render every frame of our game, it will be much slower than training without render, but there is no other way to get game frames in real situation. We could use different gym environment game, where it would give us frame without rendering, but now we'll try real life example. So, here is the function to get rendered pixels:

img = env.render(mode='rgb_array') 

We'll see following rgb image:


Cartpole gym environment outputs 600x400 RGB arrays (600x400x3). That’s way too many pixels with such simple task, more than we need. We'll convert it to grayscale and downsize it with following lines:

img_rgb = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
img_rgb_resized = cv2.resize(img_rgb, (240, 160), interpolation=cv2.INTER_CUBIC)

And we'll receive following results:


This will use less resources while training because of image with 1 channel and lower size, but to make everything simpler we'll make it full black cartpole:

img_rgb_resized[img_rgb_resized < 255] = 0

So, for final training result's we'll use following images:


But do we have all the information to determine what’s really going on? No, we don’t have any information about the movements of the game objects, and I hope we all can agree that’s it crucial in the world of games. So, how can we overcome this?

Let’s for each observation stack two consecutive frames:


Okay, now we can see the direction and the velocity of the pole. But do we know its acceleration? No. Let’s stack three frames then:


Now we can derive both the direction, velocity and acceleration of the moving objects, but as not every game is rendered at the same pace, we’ll keep 4 frames - just to be sure that we have all necessary information:


Assuming that for each step we are going to store 4 last frames, our input shape will be 240x160x4. So how we'll do these steps? First, we'll define our 4 frame image memory:
image_memory = np.zeros((4, 160, 240))

Every time before adding image to our image_memory we need to push our data by 1 frame, similar as deq() function work, by following way:
image_memory = np.roll(image_memory, 1, axis = 0)

Last step is to add image to free space:
image_memory[0,:,:] = img_rgb_resized

So, this is how our full code looks like for random steps:

import gym
import random
import numpy as np
import cv2

class DQN_CNN_Agent:
    def __init__(self, env_name):
        self.env_name = env_name       
        self.env = gym.make(env_name)
        self.ROWS = 160
        self.COLS = 240
        self.REM_STEP = 4

        self.EPISODES = 10

        self.image_memory = np.zeros((self.REM_STEP, self.ROWS, self.COLS))

    def imshow(self, image, rem_step=0):
        cv2.imshow(env_name+str(rem_step), image[rem_step,...])
        if cv2.waitKey(25) & 0xFF == ord("q"):

    def GetImage(self):
        img = self.env.render(mode='rgb_array')
        img_rgb = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        img_rgb_resized = cv2.resize(img_rgb, (self.COLS, self.ROWS), interpolation=cv2.INTER_CUBIC)
        img_rgb_resized[img_rgb_resized < 255] = 0
        img_rgb_resized = img_rgb_resized / 255

        self.image_memory = np.roll(self.image_memory, 1, axis = 0)
        self.image_memory[0,:,:] = img_rgb_resized

        return np.expand_dims(self.image_memory, axis=0)
    def reset(self):
        for i in range(self.REM_STEP):
            state = self.GetImage()
        return state

    def step(self,action):
        next_state, reward, done, info = self.env.step(action)
        next_state = self.GetImage()
        return next_state, reward, done, info

    def run(self):
        # Each of this episode is its own game.
        for episode in range(self.EPISODES):
            # this is each frame, up to 500...but we wont make it that far with random.
            for t in range(500):               
                # This will just create a sample action in any environment.
                # In this environment, the action can be 0 or 1, which is left or right
                action = self.env.action_space.sample()

                # this executes the environment with an action, 
                # and returns the observation of the environment, 
                # the reward, if the env is over, and other info.
                next_state, reward, done, info = self.step(action)
                # lets print everything in one line:
                #print(t, next_state, reward, done, info, action)
                if done:

if __name__ == "__main__":
    env_name = 'CartPole-v1'
    agent = DQN_CNN_Agent(env_name)

You can test this random code by yourself, it's quite simple.

DQN with Convolutional Neural Network:

Before merging everything to one code we must do one major improvement - implement Convolutional Neural Networks (CNN) to our current code. If you are not familiar with CNN, i have several tutorials about them, before moving forward I recommend checking them out.

So, first we'll implement few more functions for our CNN, we'll start with imports:

from keras.layers import Flatten, Conv2D, Activation

Bellow is already modified agent model, that we could use CNN. The input to the neural network will consist of 160 x 240 x 4 image. The first hidden layer convolves 64 filters of 5 x 5 with stride 3 with the input image and applies a relu activation. The second hidden layer convolves 64 filters of 4 x 4 with stride 2, again followed by a relu activation. This is followed by a third convolutional layer that convolves 64 filters of 3 x 3 with stride 1 followed by a relu activation. The final hidden layer is fully connected "Flatten" units. After these layers, goes all the same dense layers as before:

def OurModel(input_shape, action_space, dueling):
    X_input = Input(input_shape)
    X = X_input
    X = Conv2D(64, 5, strides=(3, 3),padding="valid", input_shape=input_shape, activation="relu", data_format="channels_first")(X)
    X = Conv2D(64, 4, strides=(2, 2),padding="valid", activation="relu", data_format="channels_first")(X)
    X = Conv2D(64, 3, strides=(1, 1),padding="valid", activation="relu", data_format="channels_first")(X)
    X = Flatten()(X)
    # 'Dense' is the basic form of a neural network layer
    # Input Layer of state size(4) and Hidden Layer with 512 nodes
    X = Dense(512, input_shape=input_shape, activation="relu", kernel_initializer='he_uniform')(X)

    # Hidden layer with 256 nodes
    X = Dense(256, activation="relu", kernel_initializer='he_uniform')(X)
    # Hidden layer with 64 nodes
    X = Dense(64, activation="relu", kernel_initializer='he_uniform')(X)

    if dueling:
        state_value = Dense(1, kernel_initializer='he_uniform')(X)
        state_value = Lambda(lambda s: K.expand_dims(s[:, 0], -1), output_shape=(action_space,))(state_value)

        action_advantage = Dense(action_space, kernel_initializer='he_uniform')(X)
        action_advantage = Lambda(lambda a: a[:, :] - K.mean(a[:, :], keepdims=True), output_shape=(action_space,))(action_advantage)

        X = Add()([state_value, action_advantage])
        # Output Layer with # of actions: 2 nodes (left, right)
        X = Dense(action_space, activation="linear", kernel_initializer='he_uniform')(X)

    model = Model(inputs = X_input, outputs = X, name='CartPole PER D3QN CNN model')
    model.compile(loss="mean_squared_error", optimizer=RMSprop(lr=0.00025, rho=0.95, epsilon=0.01), metrics=["accuracy"])

    return model

Now it's time to merge above random code functions (GetImage, reset, step) with our previous tutorials code.

From our previous tutorial we'll remove self.state_size function, we deosn't need this anymore because our input will be pixels. Now we'll use some new defined variables:

self.ROWS = 160
self.COLS = 240
self.REM_STEP = 4

self.image_memory = np.zeros((self.REM_STEP, self.ROWS, self.COLS))
self.state_size = (self.REM_STEP, self.ROWS, self.COLS)

Here we defined our image size, which we we'll give to our Neural Network, this is why we need to define new self.state_size. Because of this change, we must make some changes in self.model and self.target_model to following:

self.model = OurCnnModel(input_shape=self.state_size, action_space = self.action_size, dueling = self.dueling)
self.target_model = OurCnnModel(input_shape=self.state_size, action_space = self.action_size, dueling = self.dueling)

Because we changed inputs to our model, we must do changes in replay function:

  • state = np.zeros((self.batch_size, self.state_size)) we change to state = np.zeros((self.batch_size,) + self.state_size)
  • next_state = np.zeros((self.batch_size, self.state_size)) we change to next_state = np.zeros((self.batch_size,) + self.state_size)

Also, as I told, while merging above code with previous tutorials code, in main run function we change:

  • state = self.reset() to state = self.reset()
  • self.env.step(action) to self.step(action)

Full tutorial code on GitHub link:

DQN CNN agent performance

I think you are really interested how our new agent performed simply with pixel inputs. I will tell you, not that well as I wanted/expected. But at least we can see that it's working, there is a lot where to improve.

So, the same as before I trained agent, this time for 500 episodes to save my time. But as I told, that this time we'll be saving current and 3 history images, I did training tests for 1, 2, 3 and 4 history images, results:


From above chart, we can see that our agent just with one input (channels 0) performed the worsts, even our random agent could perform in a similar way. Then we can see, that agents who have more channels than one performs better, and we can assume that more channels our agent sees, better it perform. But there is one minus - more channels it has - more resources are needed for training. So, I tried to train 4 channels agent for longer, I trained it for around 3600 steps:


Red dots in graph is actual episode score, blue line is 50 games moving average score. We can say, that our agent reached best performance in around 1000 training steps and then it was not improving anymore. Our average game score was around 100, it's much worse than using 4 parameters as input than using pixels, but in real life it's impossible to get best input parameters, they always have some kind of noise or error.


That’s all! We have just created a smarter agent that learns to play balancing cartpole game from image pixels. Awesome! Remember that if you want to have an agent with really good performance with pixel data, you need many more GPU hours (about two days of training).

Don’t forget to implement each part of the code by yourself. It’s important to try to modify the code I gave you. Try to change NN architecture, change the learning rate, use a harder environment and so on.

Remember that this was quite long tutorial series, so be sure to really understand why we use these new strategies, how they work, and the advantages of using them.

If you don't understand how everything works up to this point, later to analyze and understand more complicated system will be much harder.

In next tutorial I will cover more reinforcement learning strategies on simple environments, and only when we'll have covered all the most known and popular strategies, we'll try reinforcement learning on more complicated tasks.