Reinforcement learning Bitcoin trading bot

Posted December 20, 2020 by Rokas Balsys

Creating Bitcoin trading bot that could beat the market

In this tutorial, we will continue developing a Bitcoin trading bot, but this time instead of doing trades randomly, we'll use the power of reinforcement learning.

The purpose of the previous and this tutorial is to experiment with state-of-the-art deep reinforcement learning technologies to see if we can create a profitable Bitcoin trading bot. There are many articles on the internet, that states, that neural networks can't beat the market. However, recent advances in the field have shown that RL agents are often capable of learning much more than supervised learning agents within the same problem domain. For this reason, I am writing these tutorials to experiment if it's possible, and if it is, how profitable we can make these trading bots.

While I won't be creating anything quite as impressive as OpenAI engineers do, it is still not an easy task to trade Bitcoin profitably on a day-to-day basis and do it in a profit!

Plan up to this point

Image by author

Before moving forward we'll cover what we must do to achieve our goal:

If you are not already familiar with my previous tutorials: Create a custom crypto trading environment from a scratch — Bitcoin trading bot example and Visualizing elegant Bitcoin RL trading agent chart using Matplotlib. Not so long ago I have written tutorials on both of those topics, feel free to pause here and read either of those before continuing with this (third) tutorial.


For this tutorial, I am going to use the same market history data, that I used in my previous tutorial, if you missed where I got it, here is the link The .csv file will also be available on my GitHub repository along with full this tutorial code if you just wanna test it out, before testing I recommend to read this tutorial. Okay, let’s get started.

While writing code in this tutorial, I thought, that for people who are beginners in Python might be hard with required libraries, so I decided to add a requirements.txt file to my GitHub repository, so people would know what packages they need to install. So, if you clone my code, before testing it, make sure to run pip install -r ./requirements.txt command, that will install all required packages for this tutorial:


I would like to say, that this (Third) tutorial part, will require the least custom programming creativity. I already have programmed everything we need in the past, I simply need to merge two different codes, with small modifications. Actually, if you were following me, I already wrote and tested Proximal Policy Optimization (PPO) reinforcement learning agent for the Gym LunarLander-v2 environment. So, I will need to pick that code and merge it with my previous tutorial code.

If you are not familiar with PPO, I recommend reading my previous LunarLander-v2 tutorial, this will help you to form an idea of what we are doing here. Actually, differently from my previous tutorial, now I’ll define my model architecture in another file called So I simply copy Actor_Model and Critic_Model classes to the following file and at the beginning of file, I add all necessary imports to build our code:

import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras import backend as K
#tf.config.experimental_run_functions_eagerly(True) # used for debuging and development
tf.compat.v1.disable_eager_execution() # usually using this for fastest performance

gpus = tf.config.experimental.list_physical_devices('GPU')
if len(gpus) > 0:
    print(f'GPUs {gpus}')
    try: tf.config.experimental.set_memory_growth(gpus[0], True)
    except RuntimeError: pass

class Actor_Model:
    def __init__(self, input_shape, action_space, lr, optimizer):
        X_input = Input(input_shape)
        self.action_space = action_space

        X = Flatten(input_shape=input_shape)(X_input)
        X = Dense(512, activation="relu")(X)
        X = Dense(256, activation="relu")(X)
        X = Dense(64, activation="relu")(X)
        output = Dense(self.action_space, activation="softmax")(X)

        self.Actor = Model(inputs = X_input, outputs = output)
        self.Actor.compile(loss=self.ppo_loss, optimizer=optimizer(lr=lr))

    def ppo_loss(self, y_true, y_pred):
        # Defined in
        advantages, prediction_picks, actions = y_true[:, :1], y_true[:, 1:1+self.action_space], y_true[:, 1+self.action_space:]
        LOSS_CLIPPING = 0.2
        ENTROPY_LOSS = 0.001
        prob = actions * y_pred
        old_prob = actions * prediction_picks

        prob = K.clip(prob, 1e-10, 1.0)
        old_prob = K.clip(old_prob, 1e-10, 1.0)

        ratio = K.exp(K.log(prob) - K.log(old_prob))
        p1 = ratio * advantages
        p2 = K.clip(ratio, min_value=1 - LOSS_CLIPPING, max_value=1 + LOSS_CLIPPING) * advantages

        actor_loss = -K.mean(K.minimum(p1, p2))

        entropy = -(y_pred * K.log(y_pred + 1e-10))
        entropy = ENTROPY_LOSS * K.mean(entropy)
        total_loss = actor_loss - entropy

        return total_loss

    def predict(self, state):
        return self.Actor.predict(state)

class Critic_Model:
    def __init__(self, input_shape, action_space, lr, optimizer):
        X_input = Input(input_shape)

        V = Flatten(input_shape=input_shape)(X_input)
        V = Dense(512, activation="relu")(V)
        V = Dense(256, activation="relu")(V)
        V = Dense(64, activation="relu")(V)
        value = Dense(1, activation=None)(V)

        self.Critic = Model(inputs=X_input, outputs = value)
        self.Critic.compile(loss=self.critic_PPO2_loss, optimizer=optimizer(lr=lr))

    def critic_PPO2_loss(self, y_true, y_pred):
        value_loss = K.mean((y_true - y_pred) ** 2) # standard PPO loss
        return value_loss

    def predict(self, state):
        return self.Critic.predict([state, np.zeros((state.shape[0], 1))])

I am not going to explain here how the PPO model works, as I said because that’s already covered. But you might notice one difference in code, instead of using X = Dense(512, activation=”relu”)(X_input) as input, I use X = Flatten(input_shape=input_shape)(X_input). I do so because our model input shape is 3D of shape (1, 50, 10) and our first Dense layer doesn’t understand this, so I use the Flatten layer, which gives me a concatenated array of shape (1, 500) — this is the value our model will try to learn from.

Ok, coming back to my main script, there are also many upgrades. First of all to existing imports I add three more:

from tensorboardX import SummaryWriter
from tensorflow.keras.optimizers import Adam, RMSprop
from model import Actor_Model, Critic_Model

TensorboardX will be used for our Tensorboard logs, you may use it or not, it’s up to you, but sometimes it’s useful. Next, for our experimentations I import Adam and RMSprop optimizers, so we could change and experiment with different optimizers from the main script. And lastly, I’ll import our newly created Actor and Critic classes.

In the main script, to our CustomEnv class init part I add the following code:

def __init__(self, df, initial_balance=1000, lookback_window_size=50, Render_range = 100):
    # Neural Networks part bellow = 0.0001
    self.epochs = 1
    self.normalize_value = 100000
    self.optimizer = Adam

    # Create Actor-Critic network model
    self.Actor = Actor_Model(input_shape=self.state_size, action_space = self.action_space.shape[0],, optimizer = self.optimizer)
    self.Critic = Critic_Model(input_shape=self.state_size, action_space = self.action_space.shape[0],, optimizer = self.optimizer)

    # create tensorboard writer
    def create_writer(self):
        self.replay_count = 0
        self.writer = SummaryWriter(comment="Crypto_trader")

Here I define learning rate, training epochs, and chosen Optimizer for our neural network. Also, here I define normalization/scaling value that is typically recommended and sometimes very important. Especially for neural networks, normalization can be very crucial because when we input unnormalized inputs to activation functions, we can get stuck in a very flat region in the domain and may not learn at all. Or worse, we end up with numerical issues. I would need to get deeper into this normalization stuff later, but now I will use the value of 100000 because I know that there are no bigger numbers in my dataset. In the best case, we should use normalization values between min and max, but what’s if our future market highs get bigger than we had in our history? Actually, with normalization I have more questions than answers, that I will need to answer in the future, now I’ll use this hardcoded normalization.

Although in this place we create our Actor and Critic classes, that will do all the hard works for us! The replay counter and writer are simply used for our Tensorboard logging stuff, nothing so important.

Next, I copied replay, act, save and load functions. All of them didn’t change, except replay, at the end of it, I added 3 lines used for my Tensorboard writer to log our actor and critic losses. Also, I forgot to mention that into the reset function I added a self.episode_orders parameter, which I use in the step function. Every time our agent does sell or buy orders I increment self.episode_orders by one, I use this parameter to track how many orders the agent does through one episode.

Training the agent

I think this part is one of the most waited, how actually we train this agent to do profitable trades in the market? Usually, every newcomer to reinforcement learning doesn’t know, how to actually train their agent to solve their problem in a concrete environment, so I recommend to start from simple problems and do small steps while trying more difficult environments by getting better scores. So this is the reason, why I implemented random trading in my first tutorial. Now we can use it to build on top of it, to give our agent some reasonable actions. So here is our code to train the agent:

def train_agent(env, visualize=False, train_episodes = 50, training_batch_size=500):
    env.create_writer() # create TensorBoard writer
    total_average = deque(maxlen=100) # save recent 100 episodes net worth
    best_average = 0 # used to track best average net worth
    for episode in range(train_episodes):
        state = env.reset(env_steps_size = training_batch_size)

        states, actions, rewards, predictions, dones, next_states = [], [], [], [], [], []
        for t in range(training_batch_size):
            action, prediction = env.act(state)
            next_state, reward, done = env.step(action)
            states.append(np.expand_dims(state, axis=0))
            next_states.append(np.expand_dims(next_state, axis=0))
            action_onehot = np.zeros(3)
            action_onehot[action] = 1
            state = next_state
        env.replay(states, actions, rewards, predictions, dones, next_states)
        average = np.average(total_average)
        env.writer.add_scalar('Data/average net_worth', average, episode)
        env.writer.add_scalar('Data/episode_orders', env.episode_orders, episode)
        print("net worth {} {:.2f} {:.2f} {}".format(episode, env.net_worth, average, env.episode_orders))
        if episode > len(total_average):
            if best_average < average:
                best_average = average
                print("Saving model")

Also, I am not going to explain the training part step by step, if you want to understand all steps here, check my reinforcement learning tutorials. But I will give short notices. I use the following lines to plot our net worth average and how many orders our agent does through the episode:

env.writer.add_scalar(‘Data/average net_worth’, average, episode)
env.writer.add_scalar(‘Data/episode_orders’, env.episode_orders, episode)

These above lines will help us to see, how our agent is learning stuff.

Also, instead of saving our model every step, I track the best average score our model could achieve through 100 episodes and save only the best one. Also, I am not sure if it’s a good evaluation method of our model, but we’ll see…

Testing the agent

Testing the Agent is just as important as training, or even more relevant, so we need to know how to test our agent. Test agent function is very similar to our random games agent:

def test_agent(env, visualize=True, test_episodes=10):
    env.load() # load the model
    average_net_worth = 0
    for episode in range(test_episodes):
        state = env.reset()
        while True:
            action, prediction = env.act(state)
            state, reward, done = env.step(action)
            if env.current_step == env.end_step:
                average_net_worth += env.net_worth
                print("net_worth:", episode, env.net_worth, env.episode_orders)
    print("average {} episodes agent net_worth: {}".format(test_episodes, average_net_worth/test_episodes))

There are only two differences:

  • At the beginning of the function, we load our trained model weights;
  • Second, instead of making random actions (action = np.random.randint(3, size=1)[0]) we use a trained model to predict the action (action, prediction = env.act(state));

The fun begins - the training part

One of the biggest mistakes I found others do while writing market prediction script, they do not split the data into a training set and test set, I think that's obvious that the model will perform nicely on seen data. The purpose of splitting the dataset into training and testing is to test the accuracy of our final model on fresh data it has never seen before. Since we are using time series data, we don’t have many options when it comes to cross-validation.

For example, one common form of cross-validation is called k-fold validation, in which data is split into k equal groups and one by one single out a group as the test group, and the rest of the data is used as the training group. However, time-series data is highly time-dependent, meaning later data is highly dependent on previous data. So k-fold won’t work, because our agent will learn from future data before having to trade it, that's an unfair advantage.

So, we are left with simply taking a slice of the full data frame to use as the training set from the beginning of the frame up to some arbitrary index, and using the rest of the data as the test set:

df = pd.read_csv(‘./pricedata.csv’)
df = df.sort_values(‘Date’)
lookback_window_size = 50
train_df = df[:-720-lookback_window_size]
test_df = df[-720-lookback_window_size:] # 30 days

Since our environment is only set up to handle a single data frame, we are creating two environments, one for the training data and one for the test data:

train_env = CustomEnv(train_df,lookback_window_size=lookback_window_size)
test_env = CustomEnv(test_df,lookback_window_size=lookback_window_size)

Now, training our model is as simple as creating an agent with our environment and calling the above-created training function:

train_agent(train_env, visualize=False, train_episodes=20000, training_batch_size=500)

I am not sure how long I should train it, but I chose to train for 20k steps, let’s see how our training process looks in Tensorboard by writing the following command in terminal: tensorboard --logdir runs and opening http://localhost:6006/ in our browser.

Image from Tensorboard by Author

Actually, in the above image, you can see my Tensorboard results while I was training our agent for 20k training episodes. I am 100% sure we can say YES, our agent is learning something, but it’s quite hard to tell what. As you can see, it’s fine that actor loss goes up and critic loss goes down and stabilizes through time. But one of the most interesting charts for me is episode orders and average net worth. We can see, that yes, our average net worth goes up, but only by a few percent from the initial balance - but still, that's a profit! The above graphs tell us that our agent is learning and as you can see agent thought that it’s better to do fewer orders than more. At one training moment agent event brought that the best trading strategy is holding, but we don’t want to do that. Actually, what I wanted to answer for myself from the above chart, can our agent learn something? The answer was — YES!

Test with unseen data

At first, let's see how the agent performs on data, which it never saw before:

GIF by Author

Actually, yeah, this is only a short GIF, but if you would like to see more best is to watch my YouTube video, where I show and explain everything, or you can simply clone my GitHub repository and test this agent by yourself.

Ok, let’s evaluate our agent, and let's check if we can beat a random agent for 1000 episode steps with the following two commands:

test_agent(test_env, visualize=False, test_episodes=1000)
Random_games(test_env, visualize=False, train_episodes = 1000)

And here are the results of our trained agent:

average 1000 episodes agent net_worth: 1043.4012850463675

And here are our agent random results:

average 1000 episodes random net_worth: 1024.3194149765457

Considering that this average profit of our agent was made in one month, 4% sounds like a quite nice result. The only reason why our random agent also got a nice profit I think because our trend was up. But also, seeing my trading agent gif above, it’s behavior is quite strange, an agent doesn’t like holding zero-orders, although I noticed that agent just after selling all open position, it quickly opens another buy, this would never lead us to good profits, so we’ll need to analyze and solve this problem.


This was quite a long tutorial, we’ll stop here. We achieved our goal to create a profitable Bitcoin trading agent from scratch, using deep reinforcement learning, that could beat the market! We already accomplished the following:

While our trading agent isn’t quite as profitable as we’d hoped, it is definitely getting somewhere. I am not sure what I’ll try to do in the next article, but I am sure that I’ll test some different reward strategies, maybe will do some kind of model optimizations. But we all know that we must get better profits on unseen data, so I’ll work on it!

Thanks for reading! As always, all the code given in this tutorial can be found on my GitHub page and is free to use! See you in the next part, where we’ll try to use more techniques to improve our agent!

All of these tutorials are for educational purposes, and should not be taken as trading advice. You should not trade based on any algorithms or strategies defined in this, previous, or future tutorials, as you are likely to lose your investment.