BipedalWalker-v3 with Continuous Proximal Policy Optimization

Posted November 23, 2020 by Rokas Balsys

Continuous Proximal Policy Optimization Tutorial with OpenAI gym environment

In this tutorial, we'll learn more about continuous Reinforcement Learning agents and how to teach BipedalWalker-v3 to walk!

First of all, I should mention that this tutorial is a continuation of my previous tutorial, where I covered PPO with discrete actions.

To develop a continuous action space Proximal Policy Optimization algorithm, first, we must understand what is the difference between them. Because LunarLander-v2 environment has also and continuous environment called LunarLanderContinuous-v2, I'll mention what is the difference between them:

  • LunarLander-v2 has a Discrete(4) action space. This means, there are 4 outputs (left engine, right engine, main engine, and do nothing) and we send to the environment which one we want to execute. So in practice, we pick the action with the biggest value and send to the environment one number from 0 to 3;
  • LunarLanderContinuous-v2 has a Box(2) output. This means, there are 2 continuous outputs, where we need to send both outputs and this will control the ship. The first one controls the main engine and the second one controls the left and right engines. The expected range of the output is Box[-1..1, -1..1], so we need to normalize the network to give us values inside that range (using TanH for example).

Here is the quote from LunarLanderContinuous-v2 gym website:

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from -1 to +1. First controls main engine, -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power. Second value -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off.

But this tutorial is not about LunarLanderContinuous-v2, now we'll try to learn to control two legs BipedalWalker robot!


Trained BipedalWalker-v3 agent

Reinforcement Learning in the real world is still an ill-defined problem. The agent has to be greedy, but not too greedy... One might conjecture that an optimal agent should have bayesian behavior, which again is not always what we want, nor the design goal of our brain. We want the agent to be curious so they could exploit the environment whenever possible, but not too curious so that they will continue to work for us.

If you were the head of a company, it could all be compared to training your employee. You want your employee to be exceptionally efficient at his job, while at the same time you want them to stay working for you. Which is hard, if not impossible. (unless you're Google… of course)

BipedalWalker-v3 environment

BipedalWalker-v3 is a very hard environment in the Gym. Your agent should run very fast, should not trip himself off, should use as little energy as possible. If you were checking the link, you may have noticed that it's v2 not v3, that I wrote. But this is because Gym didn't update the link, after installing the v2 environment simply doesn't work…

BipedalWalkerHardcore-v3 rendered environment (image by author)

It's quite hard to find online, what does it mean "Solving" environment, but I found it! The environment requires an average total reward of over 300 over 100 consecutive episodes to be considered as finished, which is incredibly difficult (less than 10 people solved it on Gym). I am not sure if the results are legit and official, but I found them on the following GitHub page.

My implemented PPO agent on this tutorial can't break the 300 score mark, but I hope to do that in my next tutorial. Simply while training my agent the performance stops improving and from time to time agent starts making stupid mistakes and falls, this repeats again and again.

Environment and walking strategies

Here is the quote from the openAI wiki page:

Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.


Usually, when we are programming a standard Reinforcement Learning agent, we don't care what is the observation space, our agent should fit to whatever it is, but it's better to know what are the inputs in case if our agent is not learning. So here is the observation table from the same link I mentioned above, with 24 different parameters in one state:

BipedalWalker-v3 Observation space

Also, if you paid attention to the above environment image and observation space table, you may have noticed that there is no information about the terrain in the state. This means that our agent, don't know anything about the way where he is running. He must use lidar to scan the terrain (I think so).

Action Space

BipedalWalker has 2 legs. Each leg has 2 joints. You have to teach the Bipedal-walker to walk by applying the torque on these joints. Therefore the size of our action space is 4 which is the torque applied on 4 joints. You can apply the torque in the range of (-1, 1), as shown in the following table:

BipedalWalker-v3 action space


  • The agent gets a positive reward proportional to the distance walked on the terrain. It can get a total of 300+ reward all the way up to the end;
  • If the agent tumbles, it gets a negative reward of -100;
  • There is some negative reward proportional to the torque applied on the joint. So that agent learns to walk smoothly with minimal torque.

Also, I must mention that there are 2 versions of the Bipedal environment based on terrain type:

  • Slightly uneven terrain (BipedalWalker-v3);
  • Hardcore terrain with ladders, stumps, and pitfalls (BipedalWalkerHardcore-v3).

Walking strategies

There are 4 major strategies to walk, usually, our agent tries all of them during the training process:

Walking strategies (Image by author)

Same as in my previous tutorial, first before moving ahead, let's implement this for a random-action AI agent interacting with this environment. Create a new python file named by copying and executing the following code:

import gym
import random
import numpy as np

env = gym.make("BipedalWalker-v3")

def Random_games():
    # Each of this episode is its own game.
    action_size = env.action_space.shape[0]
    for episode in range(10):
        # this is each frame, up to 500...but we wont make it that far with random.
        while True:
            # This will display the environment
            # Only display if you really want to see it.
            # Takes much longer to display it.
            # This will just create a sample action in any environment.
            # In this environment, the action can be any of one how in list on 4, for example [0 1 0 0]
            action = np.random.uniform(-1.0, 1.0, size=action_size)

            # this executes the environment with an action, 
            # and returns the observation of the environment, 
            # the reward, if the env is over, and other info.
            next_state, reward, done, info = env.step(action)
            # lets print everything in one line:
            #print(reward, action)
            if done:

In the above code, the main code line is: action = np.random.uniform(-1.0, 1.0, size=action_size), where four random numbers between -1 and 1 are generated in NumPy list form:

[ 0.79471074 -0.06168061 0.55740988 -0.74866494]
[ 0.98315975 0.73440604 -0.99219407 -0.966013 ]
[ 0.96877238 0.83464859 -0.47381784 -0.79399857]
[-0.84639089 -0.97504654 0.75642557 -0.95554083]
[ 0.4511413 -0.26661095 -0.36721001 0.20417917]
[ 0.15117511 0.32102874 0.14429348 -0.96409202]
[-0.31694119 -0.9644246 -0.89420344 0.92559095]

Now you should understand why this environment is called continuous, there might be thousands of values between -1 and 1 where the actor should decide what action is best. These numbers control our robot legs, it doesn't make a lot of sense for us, but for our AI agent, it will do! Now we successfully ran BipedalWalker random environment, now we can implement our PPO algorithm for continuous action space.

The Continuous Actor-Critic model's structure

The same as discrete PPO environments, continuous also uses the Actor-Critic approach for the agent. This means, that it uses two models, one called the Actor and the other called Critic:

Actor-Critic BipedalWalker-v3 model structure (Image by author)

The Actor model

The Actor model performs the task of learning what action to take under a particular observed state of the environment. In the BipedalWalker-v3 case, it takes 24 values list (mentioned before) of the game as input which represents the current state of our walker and gives particular actions what legs to move:

Continuous Actor model (Image by author)

Let's implement this by creating an Actor class:

class Actor_Model:
    def __init__(self, input_shape, action_space, lr, optimizer):
        X_input = Input(input_shape)
        self.action_space = action_space
        X = Dense(512, activation="relu", kernel_initializer=tf.random_normal_initializer(stddev=0.01))(X_input)
        X = Dense(256, activation="relu", kernel_initializer=tf.random_normal_initializer(stddev=0.01))(X)
        X = Dense(64, activation="relu", kernel_initializer=tf.random_normal_initializer(stddev=0.01))(X)
        output = Dense(self.action_space, activation="tanh")(X)

        self.Actor = Model(inputs = X_input, outputs = output)
        self.Actor.compile(loss=self.ppo_loss_continuous, optimizer=optimizer(lr=lr))

    def ppo_loss_continuous(self, y_true, y_pred):
        advantages, actions, logp_old_ph, = y_true[:, :1], y_true[:, 1:1+self.action_space], y_true[:, 1+self.action_space]
        LOSS_CLIPPING = 0.2
        logp = self.gaussian_likelihood(actions, y_pred)

        ratio = K.exp(logp - logp_old_ph)

        p1 = ratio * advantages
        p2 = tf.where(advantages > 0, (1.0 + LOSS_CLIPPING)*advantages, (1.0 - LOSS_CLIPPING)*advantages) # minimum advantage

        actor_loss = -K.mean(K.minimum(p1, p2))

        return actor_loss

    def gaussian_likelihood(self, actions, pred): # for keras custom loss
        log_std = -0.5 * np.ones(self.action_space, dtype=np.float32)
        pre_sum = -0.5 * (((actions-pred)/(K.exp(log_std)+1e-8))**2 + 2*log_std + K.log(2*np.pi))
        return K.sum(pre_sum, axis=1)

    def predict(self, state):
        return self.Actor.predict(state)

If we would compare our current actor and it's custom continuous loss function with the discrete PPO actor model, we would see two main differences:

  • In a discrete PPO network, our model has "softmax" activation within output in the last layer:
    output = Dense(self.action_space, activation="softmax")(X)
  • In a continuous PPO network, we use TanH instead:
    output = Dense(self.action_space, activation="tanh")(X)
    also, we can remove the activation function at all, but then we don't know what output our model gonna learn to give us, it might give us actions outside of our action space limits.
  • And the main difference is the custom PPO loss function, which is really hard to explain and moreover to understand… But I will try.

Continuous PPO loss

The action space defines the distribution used for action selection. The distribution is used to sample/select the action based on the specific distribution, and for a given action the distribution defines its probability, this probability is plugged into the formula:

PPO probability formula

Actually, the distribution usually gives us the 'log prob' and we plug it into the loss using exp(). So it's the same formula in both cases (PPO discrete and continuous), just the distribution defines the action selection and it's probability.

For discrete action spaces, we have n linear outputs and a categorical distribution (i.e. the outputs are the logits that define the probabilities using softmax, and we sample based on those. In discrete action space case the probability is just the softmax value of the chosen action).

Continuous action spaces use normal/gaussian distribution. In this case, the model output is the mean+std which defines the normal distribution to use for the action selection. The action is sampled from this distribution. The probability is in this case the probability of the action's value under the given normal distribution. I don't understand all the math behind this, or I am not sure if I understand, so I decided not to try explaining it and mislead you into a bad direction while understanding it. Also, I borrowed the idea of how to implement the Gaussian likelihood function into my code from the stable-baselines GitHub page. But, I couldn't understand and find out how to implement the log_std parameter into my TF2 code, so I used a simpler version of it as a constant parameter. If you have an idea of how to implement log_std I am open to your suggestions, maybe in the future, I'll find a better way… I am sure that this parameter is responsible for continuous action space randomness, I'll show this to you later.

The Critic model

The main role of the Critic in continuous action space doesn't change from discrete: model is learning to evaluate if the action taken by the Actor led our environment to be in a better state or not and give its feedback to the Actor.

Same as before, we send the action predicted by the Actor to our environment and observe what happens in the game. If something positive happens as a result of our action, the environment sends back a positive response in the form of a reward and vice versa if we receive a negative reward. These rewards are taken in by training our Critic model:

Critic model (Image by author)

Action picking

When we are going from discrete action space to continuous, we must change the code part where we are choosing an action. First I'll give a code part to make sense what is the difference between them:

def act_continuous(self, state):
    # Use the network to predict the next action to take, using the model
    pred = self.Actor.predict(state)

    low, high = -1.0, 1.0 # -1 and 1 are boundaries of tanh
    action = pred + np.random.uniform(low, high, size=pred.shape) * self.std
    action = np.clip(action, low, high)

    logp_t = self.gaussian_likelihood(action, pred, self.log_std)

    return action, logp_t

def gaussian_likelihood(self, action, pred, log_std):
    pre_sum = -0.5 * (((action-pred)/(np.exp(log_std)+1e-8))**2 + 2*log_std + np.log(2*np.pi)) 
    return np.sum(pre_sum, axis=1)

def act_discrete(self, state):
    # Use the network to predict the next action to take, using the model
    prediction = self.Actor.predict(state)[0]
    action = np.random.choice(self.action_size, p=prediction)
    action_onehot = np.zeros([self.action_size])
    action_onehot[action] = 1
    return action, action_onehot, prediction

So, here is an example, what happens in discrete action space:

pred = np.array([0.05, 0.85, 0.1]) 
action_size = 3 
action = np.random.choice(a, p=pred) 
action = 1

result>>> 1, because it has the highest probability to be taken

while in continuous we receive the following:

std = 0.3 # example number
random_uniform = [0.5, -0.7, 0.1, 0.9]
pred = [0.79471074, -0.06168061, 0.55740988, -0.74866494]
action = pred + random_uniform * std
action = [0.94471074, -0.27168061, 0.58740988, -0.47866494]

So, these are examples for better understanding. Also in continuous action space, we do a clipping, to make sure our actions are between -1 ar 1 boundaries. For training, we calculate log_pt with the Gaussian Likelihood, as I mentioned before.

Model Training

When we have covered most differences from discrete action space, we can finally start the model training. For this, I use mostly known the fit function of Keras in the following code:

def replay(self, states, actions, rewards, dones, next_states, logp_ts):
    # reshape memory to appropriate shape for training
    states = np.vstack(states)
    next_states = np.vstack(next_states)
    actions = np.vstack(actions)
    logp_ts = np.vstack(logp_ts)

    # Get Critic network predictions 
    values = self.Critic.predict(states)
    next_values = self.Critic.predict(next_states)

    # Compute discounted rewards and advantages
    #discounted_r = self.discount_rewards(rewards)
    #advantages = np.vstack(discounted_r - values)
    advantages, target = self.get_gaes(rewards, dones, np.squeeze(values), np.squeeze(next_values))
    pylab.subplots_adjust(left=0.05, right=0.98, top=0.96, bottom=0.06)
    if str(episode)[-2:] == "00": pylab.savefig(self.env_name+"_"+self.episode+".png")
    # stack everything to numpy array
    # pack all advantages, predictions and actions to y_true and when they are received
    # in custom loss function we unpack it
    y_true = np.hstack([advantages, actions, logp_ts])

    # training Actor and Critic networks
    a_loss =, y_true, epochs=self.epochs, verbose=0, shuffle=self.shuffle)
    c_loss =[states, values], target, epochs=self.epochs, verbose=0, shuffle=self.shuffle)

    # calculate loss parameters (should be done in loss, but couldn't find working way how to do that with disabled eager execution)
    pred = self.Actor.predict(states)
    log_std = -0.5 * np.ones(self.action_size, dtype=np.float32)
    logp = self.gaussian_likelihood(actions, pred, log_std)
    approx_kl = np.mean(logp_ts - logp)
    approx_ent = np.mean(-logp)

    self.writer.add_scalar('Data/actor_loss_per_replay', np.sum(a_loss.history['loss']), self.replay_count)
    self.writer.add_scalar('Data/critic_loss_per_replay', np.sum(c_loss.history['loss']), self.replay_count)
    self.writer.add_scalar('Data/approx_kl_per_replay', approx_kl, self.replay_count)
    self.writer.add_scalar('Data/approx_ent_per_replay', approx_ent, self.replay_count)
    self.replay_count += 1

From the above code, you can see, that everything is quite the same, except that instead of collecting predictions as we did in discrete action space, right now we collect logp_ts. We stack advantages, actions, and logp_ts to NumPy array, and we send these stacked variables to our custom PPO loss function while calling a fit function. After a fit function, you can see a little more code, but this is only for graphs, to track our model training performance.

To glue everything to one simple and working code took me a few weeks, with a lot of experiments. But right now, I can proudly announce that to train BipedalWalker-v3 is not hat hard task as it looked for me in the first days.

My code took only 6k training steps, to achieve maximum results, where 50 episodes moving average score was close to 300, running it on test environment was even better! So, here is the training graph:

Matplotlib visualization of our agent learning to play BipedalWalker-v3

As we can see from the above chart, the hardest task was to learn that we need to move forward somehow, this took around 1k training steps. When our agent understood that he needs to move forward as fast as possible, he started learning the walking strategies I mentioned above. After around 4k training steps, our agent was able to get around a maximum 300 score. And as we can see, the 50 moving average is near 300, that's an amazing result!

After training my agents, I decided to check if my agent can complete 100 episodes with an average score above 300:

Matplotlib visualization of our trained agent playing BipedalWalker-v3

From the results we can see that my moving average is above the 300 score line, this is even better than I thought!


For quite a long time, I was unable to understand the differences between discrete and continuous action spaces in PPO. This BipedalWalker-3 tutorial motivated me to get a better understanding of how everything works and how I can implement everything from scratch. By the way, Gym environments are a great place to test how our algorithm implementations work, without gym I think it wouldn't be so easy. This continuous environment still has few places where I can improve it, but depending on the results I obtained I am really proud to give you my working code. I would appreciate your suggestions, how I could implement Gaussian likelihood correctly, but anyway, even without your suggestions, I will try to do it in my spare time.

Anyway, I learned a lot while writing this tutorial, and I think in the future while working on other projects I am sure I will do them based on this code. There is still a BipedalWalkerHardcore-v3 environment that I could finish, so maybe I will play around with it before moving to the next projects.

I hope this tutorial will give you some kind of understanding of the continuous PPO model, and how to implements it. Do not hesitate to clone my GitHub repository and test it by yourself! See you in the next part!