Reinforcement learning Bitcoin trading bot

Posted January 20, 2021 by Rokas Balsys


Optimizing Bitcoin trading bot model and reward strategy to increase profitability

Let’s improve our deep RL Bitcoin trading agent code to make even more money with a better reward strategy and by testing different model structures.

In previous tutorial, we used deep reinforcement learning to create a Bitcoin trading agent that could beat the market. Although our agent was profitable compared to random actions, the results weren’t all that impressive, so this time we’re going to step it up and we’ll try to implement few more improvements with the reward system and we’ll test how our profitability depends on Neural Network model structure. Although, we'll test everything with our code and of course, you'll be able to download everything from my GitHub repository.


Reward Optimization
risk_reward.png

I would like to mention, that while writing my reward strategy it was quite hard to find what reward strategies others use in reinforcement learning while implementing automated trading strategies. It's quite hard to find what others do, and if I could find it, these strategies were poorly documented and quite complicated to understand. I believe that there are a lot of interesting and successful strategy solutions, but for this tutorial, I decided to rely on my own intuition and try my own strategy.

Someone might think that our reward function from the previous article (i.e. calculating net worth change every step) is the best we can do, however, this is far from the truth. While our simple reward function from last time was able to do small profits, it would often lead us to losses in the capital. To improve on this, we will need to consider other metrics to reward, besides simply unrealized profit.

The main improvement that comes to my head is that we must not only reward profits from holding BTC while it is increasing in price but also reward profits from not holding BTC while it is decreasing in price. For example, we could reward our agent for any incremental increase in net-worth while it is holding a BTC/USD position, and again reward it for the incremental decrease in value of BTC/USD while it is not holding any positions.

Although, we'll implement this by calculating our reward when we sell our held Bitcoins or we buy Bitcoin after our price drops. Between orders, while our agent does nothing, we won't use any rewards, because these rewards are calculated with a discount function. So, when I used this reward strategy I noticed that my agent usually learns to hold instead of learning to do profitable orders, so I decided to punish it for doing nothing. I subtract 0.01% of net-worth every step, this way our agent learned that it's not the best idea to keep holding Bitcoin, or open positions forever. Also, it understood that sometimes it's better to cut the loss and wait for another opportunity to make an order.

So here is my code of the custom reward function:

# Calculate reward
def get_reward(self):
    self.punish_value += self.net_worth * 0.00001
    if self.episode_orders > 1 and self.episode_orders > self.prev_episode_orders:
        self.prev_episode_orders = self.episode_orders
        if self.trades[-1]['type'] == "buy" and self.trades[-2]['type'] == "sell":
            reward = self.trades[-2]['total']*self.trades[-2]['current_price'] - self.trades[-2]['total']*self.trades[-1]['current_price']
            reward -= self.punish_value
            self.punish_value = 0
            self.trades[-1]["Reward"] = reward
            return reward
        elif self.trades[-1]['type'] == "sell" and self.trades[-2]['type'] == "buy":
            reward = self.trades[-1]['total']*self.trades[-1]['current_price'] - self.trades[-2]['total']*self.trades[-2]['current_price']
            reward -= self.punish_value
            self.trades[-1]["Reward"] = reward
            self.punish_value = 0
            return reward
    else:
        return 0 - self.punish_value

If you are reading this tutorial without getting familiar with my previous tutorial or full code, probably you can't understand this code, so check my code on GitHub or previous tutorials to get familiar with it.

As you can see from the above code, the first thing we do, we calculate the punishing value that is cumulative every step. If orders are made, we set it to zero. If we would look at the if statements, we would see that we have two of them: when we do buy just after sell, and wise versa. You may ask, why sometimes I use self.trades[-2] and sometimes self.trades[-1], actually, this is done because we want to calculate the reward of orders that are not really open. Because we can't do an actual profit when we sell our Bitcoin and we do buy later (except margin trading), but in the above way, we can calculate what we didn't lose while selling high and buying low.

While I was developing this strategy, it was quite tricky to understand if I implemented it correctly, so I decided to improve our render function in utils.py with the following code:

# sort sell and buy orders, put arrows in appropiate order positions
for trade in trades:
    trade_date = mpl_dates.date2num([pd.to_datetime(trade['Date'])])[0]
    if trade_date in Date_Render_range:
        if trade['type'] == 'buy':
            high_low = trade['Low']-10
            self.ax1.scatter(trade_date, high_low, c='green', label='green', s = 120, edgecolors='none', marker="^")
        else:
            high_low = trade['High']+10
            self.ax1.scatter(trade_date, high_low, c='red', label='red', s = 120, edgecolors='none', marker="v")

To more dynamic code:


minimum = np.min(np.array(self.render_data)[:,1:])
maximum = np.max(np.array(self.render_data)[:,1:])
RANGE = maximum - minimum

# sort sell and buy orders, put arrows in appropiate order positions
for trade in trades:
    trade_date = mpl_dates.date2num([pd.to_datetime(trade['Date'])])[0]
    if trade_date in Date_Render_range:
        if trade['type'] == 'buy':
            high_low = trade['Low'] - RANGE*0.02
            ycoords = trade['Low'] - RANGE*0.08
            self.ax1.scatter(trade_date, high_low, c='green', label='green', s = 120, edgecolors='none', marker="^")
        else:
            high_low = trade['High'] + RANGE*0.02
            ycoords = trade['High'] + RANGE*0.06
            self.ax1.scatter(trade_date, high_low, c='red', label='red', s = 120, edgecolors='none', marker="v")

        if self.Show_reward:
            try:
                self.ax1.annotate('{0:.2f}'.format(trade['Reward']), (trade_date-0.02, high_low), xytext=(trade_date-0.02, ycoords),
                                           bbox=dict(boxstyle='round', fc='w', ec='k', lw=1), fontsize="small")
            except:
                pass

As you can see, now instead of using hardcoded offset withing high_low, I calculate that range - this way we could use the same rendering graph for different trading pairs (not tested yet) without modifying our offset. But most importantly, I am adding the reward number bellow the buy order arrow and above the sell order arrow - this helped me to understand if I implemented my reward function correctly, take a look at it:

new_reward_strategy.png

Looking at this chart, it's quite obvious where our agent did profitable orders and where they very really bad, this is the information we'll use to train our agent.


Model modifications

We probably know that our decisions mostly depends on our knowledge of what good or bad decisions we did in our life, how fast we are learning new stuff and the list can continue… all of this depends on our brain functionality. The same is with the model, everything depends on what brains we have and how well it is trained, the biggest problem is that we don't know what architecture we should use for our model to beat the market, the only way left is to try different architectures.

We already use a basic Dense layer for our Actor and Critic neural networks:

# Critic model
X = Flatten()(X_input)
V = Dense(512, activation="relu")(X)
V = Dense(256, activation="relu")(V)
V = Dense(64, activation="relu")(V)
value = Dense(1, activation=None)(V)

# Actor model
X = Flatten()(X_input)
A = Dense(512, activation="relu")(X)
A = Dense(256, activation="relu")(A)
A = Dense(64, activation="relu")(A)

This is one of the basic methods we tried. But somewhere on the internet, I read that it's a good idea to have some kind of shared layers between Actor and Critic:

# Shared Dense layers:
X = Flatten()(X_input)
X = Dense(512, activation="relu")(X)

# Critic model
V = Dense(512, activation="relu")(X)
V = Dense(256, activation="relu")(V)
V = Dense(64, activation="relu")(V)
value = Dense(1, activation=None)(V)

# Actor model
A = Dense(512, activation="relu")(X)
A = Dense(256, activation="relu")(A)
A = Dense(64, activation="relu")(A)

This means that for example, two first neural networks should be used by both networks, and only the next (head) layers should be separated, here is an example image:

shared_model_structure.png
Critic-Actor architecture with a shared layer

Also, we will try to use recurrent neural networks used for time series and convolution neural networks, which are mostly used for image classification and detection.


Recurrent Networks


One of the obvious changes we need to test is to update our model to use a recurrent, Long Short-Term Memory (LSTM) network, in place of our previous Dense network. Since recurrent networks are capable of maintaining an internal state over time, we no longer require a sliding "look-back" window to capture the motion of the price action. Instead, it is inherently captured by the recursive nature of the network. At each time step, the input from the data set is passed into the algorithm, along with the output from the last time step. I am not going to remove it yet from my code so it won't mess us up and we'll be able to test which of our model performs better. Also, I am quite unfamiliar with LSTM networks, so I don't know by myself how to implement them properly, but here is a quite informative introduction to it.

LSTM_1.png
Sourse

This LSTM model structure allows to maintain an internal state that gets updated at each time step as the agent "remembers" and "forgets" specific data relationships:

LSTM_2.png
Sourse

I am not going deep into LSTM, because as I said, I am not very familiar with it, but I have plans to do time series analysis and tutorial series in the future. So, this is how our model will look with LSTM shared layers:

# Shared LSTM layers:
X = LSTM(512, return_sequences=True)(X_input)
X = LSTM(256)(X)

# Critic model
V = Dense(512, activation="relu")(X)
V = Dense(256, activation="relu")(V)
V = Dense(64, activation="relu")(V)
value = Dense(1, activation=None)(V)

# Actor model
A = Dense(512, activation="relu")(X)
A = Dense(256, activation="relu")(A)
A = Dense(64, activation="relu")(A)

Convolutional Networks


CNN's have been by far, the most commonly adopted deep learning model. Meanwhile, the majority of the CNN implementations in the literature were chosen for addressing computer vision and image analysis challenges. With successful implementations of CNN models, the model error rates keep dropping over the years while new CNN complicated architectures are invented. Nowadays, almost all computer vision researchers, one way or another, implement CNN in image classification problems.

In this paper was introduced a study, where authors propose a novel approach that converts 1-D financial time series into a 2-D image-like data representation in order to be able to utilize the power of the deep convolution neural network for an algorithmic trading system. However, the authors wrote an interesting article and made quite impressive results. Their proposed CNN model trained on time-series images performed quite similar to the LSTM network - sometimes better, sometimes worse. But major improvement is that CNN doesn't require as much computational power and time to train the model, so we'll train our agent and test it with the following model:

# Shared CNN layers:
X = Conv1D(filters=64, kernel_size=6, padding="same", activation="tanh")(X_input)
X = MaxPooling1D(pool_size=2)(X)
X = Conv1D(filters=32, kernel_size=3, padding="same", activation="tanh")(X)
X = MaxPooling1D(pool_size=2)(X)
X = Flatten()(X)

# Critic model
V = Dense(512, activation="relu")(X)
V = Dense(256, activation="relu")(V)
V = Dense(64, activation="relu")(V)
value = Dense(1, activation=None)(V)

# Actor model
A = Dense(512, activation="relu")(X)
A = Dense(256, activation="relu")(A)
A = Dense(64, activation="relu")(A)

Other minor changes

So, up to this point we mainly talked about our reward and model improvements. But there also are other ways, where we can make our life's easier when training and testing different models in between.

First of all, I changed the way how we save our model because I don't know when the model is over-fitting and what signalizes about this sign, at least right now, I decided to save every best model. So instead of saving the best model on top of our older models, I am creating a new folder where I will save these models and I'll use the average reward as a name for it.

Also, I noticed, that it's quite hard to remember all the parameters we set, while testing/training every new model, so I am creating a Parameters.txt file in the same model saving location. I noticed, that when I was testing my own models, it gets quite messy and I can't remember what were the best results of my model. So I write the following parameters to a text file:

params.write(f"training start: {current_date}\n")
params.write(f"initial_balance: {initial_balance}\n")
params.write(f"training episodes: {train_episodes}\n")
params.write(f"lookback_window_size: {self.lookback_window_size}\n")
params.write(f"lr: {self.lr}\n")
params.write(f"epochs: {self.epochs}\n")
params.write(f"batch size: {self.batch_size}\n")
params.write(f"normalize_value: {normalize_value}\n")
params.write(f"model: {self.comment}\n")

So, right now I will know what initial balance I started with, what look-back window I used, what learning rate was used for training, how many epochs I used for one episode, what normalization value I used and finally I can write what model type I use, this makes my test results much easier!

Another important step in our testing results, it's convenient to have them in one place for comparison simplicity. So I inserted the following lines into my code:

print("average {} episodes agent net_worth: {}, orders: {}".format(test_episodes, average_net_worth/test_episodes, average_orders/test_episodes))
print("No profit episodes: {}".format(no_profit_episodes))
# save test results to test_results.txt file
with open("test_results.txt", "a+") as results:
    results.write(f'{datetime.now().strftime("%Y-%m-%d %H:%M")}, {name}, test episodes:{test_episodes}')
    results.write(f', net worth:{average_net_worth/(episode+1)}, orders per episode:{average_orders/test_episodes}')
    results.write(f', no profit episodes:{no_profit_episodes}, comment: {comment}\n')

With these results, we can compare with what average net_worth our agent traded all testing episodes, how many orders it did on average through the episode, and one of the most important metrics would be "no profit episodes" - this will show us, how many episodes we finished in the negative side through our testings. Also, there are a lot of metrics we could add, but now, it's enough for us to compare and choose the best we need. Also, this way we'll be able to test the bulk of models at once, leave it for a night, and to check results in the morning.

Also, I did little modifications to the model save function, now while saving our model, we will be able to log our parameters about the current saved model time step with the following function:

def save(self, name="Crypto_trader", score="", args=[]):
    # save keras model weights
    self.Actor.Actor.save_weights(f"{self.log_name}/{score}_{name}_Actor.h5")
    self.Critic.Critic.save_weights(f"{self.log_name}/{score}_{name}_Critic.h5")

    # log saved model arguments to file
    if len(args) > 0:
        with open(f"{self.log_name}/log.txt", "a+") as log:
            current_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            log.write(f"{current_time}, {args[0]}, {args[1]}, {args[2]}, {args[3]}, {args[4]}\n")

Training and testing

Training our models

Right now we have 3 different model architectures (Dense, CNN, and LSTM), now I'll invest my time and my 1080TI GPU, to make this first comparison between architectures. As I already mentioned, I will train all models on the same dataset with the same parameters, I only gonna change the model type. After training, we'll be able to compare training duration, compare Tensorboard training graphs and of course, we'll get trained models. We'll test all of these trained models on the same testing data set and we'll see how they perform on unseen market data!

So, I started training the model from the simplest Dense model with the following code lines:

if __name__ == "__main__":            
    df = pd.read_csv('./pricedata.csv')
    df = df.sort_values('Date')

    lookback_window_size = 50
    test_window = 720 # 30 days 
    train_df = df[:-test_window-lookback_window_size]
    test_df = df[-test_window-lookback_window_size:]
    
    # Create our custom Neural Networks model
    agent = CustomAgent(lookback_window_size=lookback_window_size, lr=0.00001, epochs=5, optimizer=Adam, batch_size = 32, comment="Dense")
    
    # Create and run custom training environment with following lines
    train_env = CustomEnv(train_df, lookback_window_size=lookback_window_size)
    train_agent(train_env, agent, visualize=False, train_episodes=50000, training_batch_size=500)

When the training starts, you can see that a new folder with the current date and time is created in the same folder where we can find the Parameters.txt file. If I would open this file we can see all our current used settings in it:

training start: 2021–01–11 13:32
initial_balance: 1000
training episodes: 50000
lookback_window_size: 50
lr: 1e-05
epochs: 5
batch size: 32
normalize_value: 40000
model: Dense
training end: 2021–01–11 18:20

As you can see, in this file we saved all the parameters that we used to train our model, we can even see how long it took to train my model. This helps later, when we will be trying to find the best-optimized model, or even when we train different models manually, usually because it takes some time as you can see it easy to forget what parameters we used...

Also, we can see that log.txt file was created, here is saved all saved model statistics at that time, these might help to find the best model we trained that is not overfitting.

All the models our agent saves are located in the same directory. So, when we are testing them, we'll need to specify the right directory and name of our model.

Bellow is my snip from Tensorboard while training Dense (Orange color) and CNN (Blue color) networks for 50k training steps. It's sad but by my mistake, I removed the LSTM training graph, it took me too long that I could train it again, will do that in the coming tutorials when our model will be a little better:

Tensorboard.png
Tensorboard graph of Dense and CNN model

Right now let's compare Dense and CNN models and how they trained. First, let's look at average_net_worth graph. As we can see our CNN model learned to get much higher rewards in time but we shouldn't trust these results blindly, I think that our CNN might be overfitting. First of all, it seems very suspicious to me how our critic network was training, and how this critic_loss_per_replay graph looks for our CNN, numbers are a hundred times bigger than while training the Dense model...

Second, while looking at our actor_loss_per_replay graph, we can see that while training the Dense model, we might see quite a beautiful curve. But while seeing the CNN training actor loss curve on the same graph, it gives some kind of training instability, we can see that this curve was coming in up direction since around the 20k episode. But ok, this is in my preliminary view, we'll test both models, that one on the curve top and the best model by average net worth.

Also, it's quite interesting to see the average episode orders graph, because we added a punish value into our reward strategy it's quite logical that our model tries to avoid this punishment and learn to do as many orders as possible. But instead of doing orders every step, it learned that sometimes it's better to be punished and wait for better order conditions where it could get a positive reward!

Testing our models

As you already know, we trained 3 different models (Dense, CNN, LSTM) for 50k training steps, we can test all of them at once with the following code:

if __name__ == "__main__":            
    df = pd.read_csv('./pricedata.csv')
    df = df.sort_values('Date')

    lookback_window_size = 50
    test_window = 720 # 30 days 
    train_df = df[:-test_window-lookback_window_size]
    test_df = df[-test_window-lookback_window_size:]

    agent = CustomAgent(lookback_window_size=lookback_window_size, lr=0.00001, epochs=1, optimizer=Adam, batch_size = 32, model="Dense")
    test_env = CustomEnv(test_df, lookback_window_size=lookback_window_size, Show_reward=False)
    test_agent(test_env, agent, visualize=False, test_episodes=1000, folder="2021_01_11_13_32_Crypto_trader", name="1277.39_Crypto_trader", comment="")

    agent = CustomAgent(lookback_window_size=lookback_window_size, lr=0.00001, epochs=1, optimizer=Adam, batch_size = 32, model="CNN")
    test_env = CustomEnv(test_df, lookback_window_size=lookback_window_size, Show_reward=False)
    test_agent(test_env, agent, visualize=False, test_episodes=1000, folder="2021_01_11_23_48_Crypto_trader", name="1772.66_Crypto_trader", comment="")
    test_agent(test_env, agent, visualize=False, test_episodes=1000, folder="2021_01_11_23_48_Crypto_trader", name="1377.86_Crypto_trader", comment="")

    agent = CustomAgent(lookback_window_size=lookback_window_size, lr=0.00001, epochs=1, optimizer=Adam, batch_size = 128, model="LSTM")
    test_env = CustomEnv(test_df, lookback_window_size=lookback_window_size, Show_reward=False)
    test_agent(test_env, agent, visualize=False, test_episodes=1000, folder="2021_01_11_23_43_Crypto_trader", name="1076.27_Crypto_trader", comment="")

In my previous tutorial, our simplest dense network for 1000 episodes scored 1043.40$ average net worth. This is the score we want to beat.

Dense

First I trained and tested the Dense network and received the following results:

Model name: 1277.39_Crypto_trader
net worth: 1054.483903083776
orders per episode: 140.566
no profit episodes: 14

As you can see in the previous tutorial we didn't measure how many orders per episode our model does and how many orders finished in negative net worth through 1000 episodes. But this metric is not that important as the new "no profit episodes" metric, because it's better to have lower profit but be sure that our model won't lose our money. So it's best to evaluate "net worth" together with the "no profit episodes" metric, anyway, we can see that our current Dense model with the new reward strategy did a little better!

CNN

Next, I trained and tested the Convolution Neural Networks (CNN) model, you may find a lot of articles about CNN's in time series but this is not the topic to talk about that. So I'll take the saved model with the best average reward and let's see the results:

Model name: 1772.66_Crypto_trader
net worth: 1008.5786807732876
orders per episode: 134.177
no profit episodes: 341

As we can see, our model doesn't perform as well as it was performing while training, it's obvious that we have some kind of overfitting with it, results are terrible 34% of orders had negative ending balance and profit was even worse than the random model would do… So I decided to test another CNN model, that has less overfitting relying on the Tensorboard graph:

Model name: 1124.03_Crypto_trader
net worth: 1034.3430652376387
orders per episode: 70.152
no profit episodes: 55

As we can see, this model performed much better than training it up to the end, but still, our simplest Dense network wins against it with a testing dataset.

LSTM

Finally, I thought that let's try to train the LSTM Network, it's created for time series data, it should perform well:

Model name: 1076.27_Crypto_trader
net worth: 1027.3897665208062
orders per episode: 323.233
no profit episodes: 303

To train the LSTM network took around 3 times longer than the Dense and CNN networks, so I was really sad when I wasted so much time and received awful results. I thought that I totally gonna do something interesting, but now we have what we have.

Conclusion:

I decided to stop my article here because it took me too long to do these written experiments and compare them. But I am glad that at least I was able to improve my Dense network profitability with a new strategy.

I won't rush to say that it is not appropriate to use CNN's and LSTM's to predict time series marked data, in my opinion, we used too less training data that our model could properly learn all market features.

I won't give up so easily, I still think that our Neural Networks can beat the market, but he needs more data. So I have plans in near future to write at least two more tutorials. First, we'll try to implement indicators to our market data, so our model will have more features to learn from. Second, it's quite obvious that we are using too less training data to train our model, so we'll write some script to download more historical data from the internet, and finally, we'll write a copy of our custom trading environment that we could use it in parallel, so we could use multiprocessing to run multiple training environments to speed up the training process.

Thanks for reading! As always, all the code given in this tutorial can be found on my GitHub page and is free to use! See you in the next part, there is still a lot of work to do, subscribe and like my video on YouTube, share this tutorial and you'll be notified when another tutorial will see the daylight!

All of these tutorials are for educational purposes, and should not be taken as trading advice. You should not trade based on any algorithms or strategies defined in this, previous, or future tutorials, as you are likely to lose your investment.