Deep Reinforcement Learning for Automated Stock Trading

Shahar Gino
21 min readJul 31, 2024

--

Deep Reinforcement Learning for Automated Stock Trading

References:

  1. Yang, Hongyang, et al. “Deep reinforcement learning for automated stock trading: An ensemble strategy.Proceedings of the first ACM international conference on AI in finance. 2020.‏
  2. https://github.com/theanh97/Deep-Reinforcement-Learning-with-Stock-Trading
  3. https://medium.com/@pta.forwork/deep-reinforcement-learning-for-automated-stock-trading-9d47457707fa

Overview

This project aims to create an automated stock trading system utilizing deep reinforcement learning (DRL). By applying advanced machine learning techniques, the system will be trained to make profitable trading decisions in the stock market. The project includes downloading historical stock data, enhancing it with technical indicators, developing a simulated trading environment, and training several DRL agents such as PPO, A2C, DDPG, SAC, TD3, and an ensemble agent. The performance of these agents will be assessed and compared using various financial metrics.

This project is inspired by the research paper mentioned above as “reference 1". It is also influenced by “references 2 and 3”, used them as a technical baseline, adding several fixes and enhancements for the presented work there.

Deep Reinforcement Learning (DRL)

Deep learning in the context of reinforcement learning involves an agent interacting with an environment to learn optimal behaviors through trial and error. The main terminologies and concepts in this domain are crucial for understanding how these systems operate:

  • Environment: The external system with which the agent interacts. It provides a setting where actions can be taken, and responses (in the form of observations and rewards) are received. Examples include game worlds, robotic systems, or any simulated scenario where learning takes place.
  • State: A representation of the current situation or configuration of the environment. The state captures all necessary information required for decision-making. In a game, a state could include the positions of all characters and objects.
  • Observations: The data perceived by the agent from the environment. Observations can be partial or complete representations of the state. In many scenarios, the agent does not have access to the full state and must rely on observations to infer it.
  • Actions: The decisions or moves made by the agent in response to the current state or observation. Actions alter the state of the environment. The set of all possible actions an agent can take is called the action space.
  • Step: A single iteration in the interaction cycle between the agent and the environment. During a step, the agent takes an action based on its current policy, receives an observation and a reward from the environment, and transitions to a new state.
  • Policy: A strategy employed by the agent to decide which action to take given a state or observation. Policies can be deterministic (always the same action for a given state) or stochastic (actions are chosen according to a probability distribution).
  • Reward: A scalar feedback signal received by the agent after taking an action. The reward quantifies the immediate benefit of the action and is used to reinforce desirable behaviors. The goal of the agent is to maximize the cumulative reward over time.
  • Episode: A sequence of steps from an initial state to a terminal state. An episode ends when a predefined condition is met, such as reaching a goal or running out of time.

In reinforcement learning, the agent aims to learn a policy that maximizes the cumulative reward by repeatedly interacting with the environment, taking actions, and adjusting its policy based on the received rewards. The combination of these elements allows the agent to develop sophisticated behaviors and improve its performance through experience.

Following is a brief overview for common reinforcement learning algorithms PPO, A2C, DDPG, SAC, and TD3, aka “DRL agents”:

DRL agents comparison

Implementation

To start our journey into Deep Reinforcement Learning for Automated Stock Trading, we first need to gather the necessary data. This initial step involves loading historical stock data, which serves as the foundation for our trading models.

In this project, we use Python libraries such as numpy, pandas, and yfinance to fetch and manipulate stock market data. Specifically, we focus on the Dow Jones 30, a list of 30 prominent stocks. We utilize Yahoo Finance to download historical data for these stocks, spanning from January 1, 2009, to May 8, 2020. This data will be essential for training and testing our reinforcement learning models. By creating a dictionary to store this data, we ensure efficient and organized access throughout the project. This setup allows us to analyze and preprocess the stock data before feeding it into our trading algorithms.

import numpy as np
import pandas as pd
import yfinance as yf
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt
from stable_baselines3 import PPO, A2C, DDPG, SAC, TD3
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback

# List of stocks in the Dow Jones 30
tickers = [
'MMM', 'AXP', 'AAPL', 'BA', 'CAT', 'CVX', 'CSCO', 'KO', 'DIS', 'DOW',
'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'JPM', 'MCD', 'MRK', 'MSFT', 'NKE',
'PFE', 'PG', 'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WBA', 'WMT', 'XOM'
]
tickers.remove('DOW')
tickers.remove('UTX')

# Get historical data from Yahoo Finance and save it to dictionary
def fetch_stock_data(tickers, start_date, end_date):
stock_data = {}
for ticker in tickers:
stock_data[ticker] = yf.download(ticker, start=start_date, end=end_date)
return stock_data

# Call the function to get data
stock_data = fetch_stock_data(tickers, '2009-01-01', '2020-05-08')

To ensure our models are robust and generalizable, we split the historical stock data into three distinct sets: training, validation, and test datasets. This allows us to train the model on one set of data, validate its performance on a second set, and finally test its effectiveness on a third set, ensuring that our model performs well on unseen data.

For this project, we designate the period from January 1, 2009, to December 31, 2015, as the training dataset. This is the largest dataset and is used to train our reinforcement learning models. The validation dataset, spanning from January 1, 2016, to December 31, 2016, is used to fine-tune the model and prevent overfitting. Finally, the test dataset, from January 1, 2017, to May 8, 2020, is used to evaluate the model’s performance in real-world scenarios.

We proceed by splitting the data for each stock in the Dow Jones 30 accordingly. By plotting the open prices of Apple Inc. (AAPL) across these three periods, we visualize the data distribution and ensure the splits are correctly implemented. This careful partitioning is critical for the development of a reliable automated trading system.

# split the data into training, validation and test sets
training_data_time_range = ('2009-01-01', '2015-12-31')
validation_data_time_range = ('2016-01-01', '2016-12-31')
test_data_time_range = ('2017-01-01', '2020-05-08')

# split the data into training, validation and test sets
training_data = {}
validation_data = {}
test_data = {}

for ticker, df in stock_data.items():
training_data[ticker] = df.loc[training_data_time_range[0]:training_data_time_range[1]]
validation_data[ticker] = df.loc[validation_data_time_range[0]:validation_data_time_range[1]]
test_data[ticker] = df.loc[test_data_time_range[0]:test_data_time_range[1]]

# print shape of training, validation and test data
ticker = 'AAPL'
print(f'- Training data shape for {ticker}: {training_data[ticker].shape}')
print(f'- Validation data shape for {ticker}: {validation_data[ticker].shape}')
print(f'- Test data shape for {ticker}: {test_data[ticker].shape}\n')

# Display the first 5 rows of the data
display(stock_data['AAPL'].head())
print('\n')

# Plot:
plt.figure(figsize=(12, 4))
plt.plot(training_data[ticker].index, training_data[ticker]['Open'], label='Training', color='blue')
plt.plot(validation_data[ticker].index, validation_data[ticker]['Open'], label='Validation', color='red')
plt.plot(test_data[ticker].index, test_data[ticker]['Open'], label='Test', color='green')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title(f'{ticker} Stock, Open Price')
plt.legend()
plt.show()

The corresponding results shall be as following:

Data Preparation

Next, we enrich our datasets with various technical indicators that are essential for trading strategy development. We calculate several key metrics:

  • MACD (Moving Average Convergence Divergence): This indicator involves computing the 12-day and 26-day Exponential Moving Averages (EMAs) to determine the MACD line, and then applying a 9-day EMA to the MACD line to generate the Signal line. The MACD and Signal lines help in identifying potential buy or sell signals based on their crossovers.
  • RSI (Relative Strength Index): We calculate the RSI with a 14-day window to gauge the momentum of price movements. This indicator helps to identify overbought or oversold conditions by measuring the speed and change of price movements.
  • CCI (Commodity Channel Index): This indicator assesses the deviation of the price from its average, which helps in identifying new trends or extreme conditions. We use a 20-day window to calculate the CCI.
  • ADX (Average Directional Index): To measure the strength of a trend, we compute the ADX using a 14-day window. This involves calculating the Directional Movement (DM) indicators and the Average True Range (ATR) to determine the ADX value.

By adding these indicators, we transform the raw stock price data into a feature-rich dataset that better captures market trends and price dynamics. This enhanced dataset is then used to train and evaluate our reinforcement learning models.

def add_technical_indicators(df):

df = df.copy()

# Calculate EMA 12 and 26 for MACD
df.loc[:, 'EMA12'] = df['Close'].ewm(span=12, adjust=False).mean()
df.loc[:, 'EMA26'] = df['Close'].ewm(span=26, adjust=False).mean()
df.loc[:, 'MACD'] = df['EMA12'] - df['EMA26']
df.loc[:, 'Signal'] = df['MACD'].ewm(span=9, adjust=False).mean()

# Calculate RSI 14
rsi_14_mode = True
delta = df['Close'].diff()
if rsi_14_mode:
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
else:
up = delta.where(delta > 0, 0)
down = -delta.where(delta < 0, 0)
rs = up.rolling(window=14).mean() / down.rolling(window=14).mean()
df.loc[:, 'RSI'] = 100 - (100 / (1 + rs))

# Calculate CCI 20
tp = (df['High'] + df['Low'] + df['Close']) / 3
sma_tp = tp.rolling(window=20).mean()
mean_dev = tp.rolling(window=20).apply(lambda x: np.mean(np.abs(x - x.mean())))
df.loc[:, 'CCI'] = (tp - sma_tp) / (0.015 * mean_dev)

# Calculate ADX 14
high_diff = df['High'].diff()
low_diff = df['Low'].diff()
df.loc[:, '+DM'] = np.where((high_diff > low_diff) & (high_diff > 0), high_diff, 0)
df.loc[:, '-DM'] = np.where((low_diff > high_diff) & (low_diff > 0), low_diff, 0)
tr = pd.concat([df['High'] - df['Low'], np.abs(df['High'] - df['Close'].shift(1)), np.abs(df['Low'] - df['Close'].shift(1))], axis=1).max(axis=1)
atr = tr.ewm(span=14, adjust=False).mean()
df.loc[:, '+DI'] = 100 * (df['+DM'].ewm(span=14, adjust=False).mean() / atr)
df.loc[:, '-DI'] = 100 * (df['-DM'].ewm(span=14, adjust=False).mean() / atr)
dx = 100 * np.abs(df['+DI'] - df['-DI']) / (df['+DI'] + df['-DI'])
df.loc[:, 'ADX'] = dx.ewm(span=14, adjust=False).mean()

# Drop NaN values
df.dropna(inplace=True)

# Keep only the required columns
df = df[['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'Signal', 'RSI', 'CCI', 'ADX']]

return df

# -----------------------------------------------------------------------------

# add technical indicators to the training data for each stock
for ticker, df in training_data.items():
training_data[ticker] = add_technical_indicators(df)

# add technical indicators to the validation data for each stock
for ticker, df in validation_data.items():
validation_data[ticker] = add_technical_indicators(df)

# add technical indicators to the test data for each stock
for ticker, df in test_data.items():
test_data[ticker] = add_technical_indicators(df)

# print the first 5 rows of the data
print(f'- Training data shape for {ticker}: {training_data[ticker].shape}')
print(f'- Validation data shape for {ticker}: {validation_data[ticker].shape}')
print(f'- Test data shape for {ticker}: {test_data[ticker].shape}\n')

display(test_data[ticker].head())

The corresponding results shall be as following:

Extraction of technical indicators

In the next section, we define a custom trading environment for our reinforcement learning model using the OpenAI Gym framework. This environment simulates stock trading and allows the agent to interact with the market through actions like buying, selling, or holding stocks.

Key Features of the Environment:

  • Initialization: The environment is initialized with historical stock data and sets up various parameters, including action and observation spaces, transaction costs, and account variables such as balance, net worth, and shares held.
  • Observation Space: At each step, the environment provides a comprehensive state that includes current stock prices, account balance, shares held, net worth, and other relevant metrics. This observation space is crucial for the agent to make informed decisions.
  • Action Space: The action space is defined as a continuous space where the agent can decide the proportion of the portfolio to buy or sell for each stock. Positive values represent buying actions, while negative values represent selling actions.
  • Step Function: The step function executes the agent's actions, updates the account balance and shares held, calculates the new net worth, and determines the reward. It also manages transaction costs and checks if the episode should end based on the maximum number of steps or if the net worth falls below zero.
  • Rendering: The render function provides a human-readable output of the current state, including the step number, balance, shares held, net worth, and profit.
  • Reset: The reset function reinitializes the environment for a new episode, ensuring that the agent starts with the initial conditions and data.

This custom environment is designed to closely mimic real-world trading scenarios, providing the reinforcement learning agent with the necessary tools to learn and optimize trading strategies.

class StockTradingEnv(gym.Env):

metadata = {'render_modes': ['human']}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def __init__(self, stock_data, transaction_cost_percent=0.005):
super(StockTradingEnv, self).__init__()
"""
This function initializes the environment with stock data and sets up necessary variables:
- Action and Observation Space: Defines the action space (buy/sell/hold) and
observation space (stock prices, balance, shares held, net worth, etc.).
- Account Variables: Initializes balance, net worth, shares held, and transaction costs.
"""

# Remove any empty DataFrames
self.stock_data = {ticker: df for ticker, df in stock_data.items() if not df.empty}
self.tickers = list(self.stock_data.keys())

if not self.tickers:
raise ValueError("All provided stock data is empty")

# Calculate the size of one stock's data
sample_df = next(iter(self.stock_data.values()))
self.n_features = len(sample_df.columns)

# Define action and observation space
self.action_space = spaces.Box(low=-1, high=1, shape=(len(self.tickers),), dtype=np.float32)

# Observation space: price data for each stock + balance + shares held + net worth + max net worth + current step
self.obs_shape = self.n_features * len(self.tickers) + 2 + len(self.tickers) + 2
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(self.obs_shape,), dtype=np.float32)

# Initialize account balance
self.initial_balance = 1000
self.balance = self.initial_balance
self.net_worth = self.initial_balance
self.max_net_worth = self.initial_balance
self.shares_held = {ticker: 0 for ticker in self.tickers}
self.total_shares_sold = {ticker: 0 for ticker in self.tickers}
self.total_sales_value = {ticker: 0 for ticker in self.tickers}

# Set the current step
self.current_step = 0

# Calculate the minimum length of data across all stocks
self.max_steps = max(0, min(len(df) for df in self.stock_data.values()) - 1)

# Transaction cost
self.transaction_cost_percent = transaction_cost_percent

# Short Strategy
self.short = False

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def reset(self, seed=None, options=None):
super().reset(seed=seed)
""" Resets the environment to its initial state for a new episode. """

# Reset the account balance
self.balance = self.initial_balance
self.net_worth = self.initial_balance
self.max_net_worth = self.initial_balance
self.shares_held = {ticker: 0 for ticker in self.tickers}
self.total_shares_sold = {ticker: 0 for ticker in self.tickers}
self.total_sales_value = {ticker: 0 for ticker in self.tickers}
self.current_step = 0
return self._next_observation(), {}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def _next_observation(self):
""" Returns the current state of the environment, including stock prices, balance, shares held, net worth, etc. """

# initialize the frame
frame = np.zeros(self.obs_shape)

# Add stock data for each ticker
idx = 0
# Loop through each ticker
for ticker in self.tickers:
# Get the DataFrame for the current ticker
df = self.stock_data[ticker]
# If the current step is less than the length of the DataFrame, add the price data for the current step
if self.current_step < len(df):
frame[idx:idx+self.n_features] = df.iloc[self.current_step].values
# Otherwise, add the last price data available
elif len(df) > 0:
frame[idx:idx+self.n_features] = df.iloc[-1].values
# Move the index to the next ticker
idx += self.n_features

# Add balance, shares held, net worth, max net worth, and current step
frame[-4-len(self.tickers)] = self.balance # Balance
frame[-3-len(self.tickers):-3] = [self.shares_held[ticker] for ticker in self.tickers] # Shares held
frame[-3] = self.net_worth # Net worth
frame[-2] = self.max_net_worth # Max net worth
frame[-1] = self.current_step # Current step

return frame

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def step(self, actions):
""" Executes an action in the environment, updates the state, calculates rewards, and checks if the episode is done. """

# Update the current step
self.current_step += 1

# Check if we have reached the maximum number of steps
if self.current_step > self.max_steps:
return self._next_observation(), 0, True, False, {}

close_prices = {}

# Loop through each ticker and perform the action
for i, ticker in enumerate(self.tickers):

# Get the current open and close price of the stock
current_day = self.stock_data[ticker].iloc[self.current_step]
open_price = current_day['Open']
close_price = current_day['Close']

# Record the close price
close_prices[ticker] = close_price

# Get the action for the current ticker
action = actions[i]

action_price = open_price if self.short else close_price

if action > 0: # Buy
# Calculate the number of shares to buy
shares_to_buy = int(self.balance * action / action_price)
# Calculate the cost of the shares
cost = shares_to_buy * action_price
# Transaction cost
transaction_cost = cost * self.transaction_cost_percent
# Update the balance and shares held
self.balance -= (cost + transaction_cost)
# Update the total shares sold
self.shares_held[ticker] += shares_to_buy

elif action < 0: # Sell
# Calculate the number of shares to sell
shares_to_sell = int(self.shares_held[ticker] * abs(action))
# Calculate the sale value
sale = shares_to_sell * action_price
# Transaction cost
transaction_cost = sale * self.transaction_cost_percent
# Update the balance and shares held
self.balance += (sale - transaction_cost)
# Update the total shares sold
self.shares_held[ticker] -= shares_to_sell
# Update the shares sold
self.total_shares_sold[ticker] += shares_to_sell
# Update the total sales value
self.total_sales_value[ticker] += sale

# Calculate the net worth
self.net_worth = self.balance + sum(self.shares_held[ticker] * close_prices[ticker] for ticker in self.tickers)

# Update the max net worth
self.max_net_worth = max(self.net_worth, self.max_net_worth)

# Calculate the reward
reward = self.net_worth - self.initial_balance

# Check if the episode is done
done = self.net_worth <= 0 or self.current_step >= self.max_steps

obs = self._next_observation()

return obs, reward, done, False, {}

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def render(self, mode='human'):
""" Displays the current state of the environment in a human-readable format. """

# Print the current step, balance, shares held, net worth, and profit
profit = self.net_worth - self.initial_balance
print(f'Step: {self.current_step}')
print(f'Balance: {self.balance:.2f}')
for ticker in self.tickers:
print(f'{ticker} Shares held: {self.shares_held[ticker]}')
print(f'Net worth: {self.net_worth:.2f}')
print(f'Profit: {profit:.2f}')

def close(self):
""" Placeholder for any cleanup operations """
pass

Next, we set up various reinforcement learning agents to interact with the trading environment defined earlier. Each agent is based on a different reinforcement learning algorithm, and an ensemble agent combines the strengths of these individual models.

PolicyGradientLossCallback Class

  • Purpose: This custom callback logs the policy gradient loss during training for performance monitoring.
  • Functionality:
  • _on_step: Captures and appends the policy gradient loss from the model's logger.
  • _on_training_end: Plots the policy gradient loss after training to visualize the loss progression over time.

Individual Trading Agents

  1. PPOAgent (Proximal Policy Optimization)
  • Initialization: Sets up the PPO model with a specified number of timesteps and a threshold for action decisions.
  • Methods:
  • predict: Returns the action decided by the PPO model for a given observation.
  • action_to_recommendation: Converts the model's action into trading recommendations (buy/sell/hold) based on a threshold.
  • validate: Evaluates the agent's performance by running it in the environment and calculating the total rewards.

2. Other agents:

  • A2CAgent (Advantage Actor-Critic)
  • DDPGAgent (Deep Deterministic Policy Gradient)
  • SACAgent (Soft Actor-Critic)
  • TD3Agent (Twin Delayed Deep Deterministic Policy Gradient)
  • Initialization: Inherits from PPOAgent but uses the A2C, DDPG, SAC, TD3 algorithms, respectively. They also include the PolicyGradientLossCallback to track loss during training.

3. EnsembleAgent

  • Purpose: Combines the predictions from multiple individual models (PPO, A2C, DDPG, SAC, and TD3) to make a final decision. This ensemble approach aims to leverage the strengths of each algorithm.
  • Methods:
  • predict: Averages the actions predicted by each individual model to determine the ensemble action.
  • action_to_recommendation: Translates the ensemble action into buy/sell/hold recommendations based on a threshold.
  • validate: Tests the ensemble agent's performance in the environment and calculates the total rewards.

These agents are designed to handle trading decisions in the environment and are validated to ensure their effectiveness in maximizing rewards and making informed trading choices.

class PolicyGradientLossCallback(BaseCallback):
"""
A custom callback class that logs the policy_gradient_loss during training.
This class extends BaseCallback and used to capture and store the metrics we want.
"""

def __init__(self, verbose=0):
super(PolicyGradientLossCallback, self).__init__(verbose)
self.losses = []

def _on_step(self) -> bool:
if hasattr(self.model, 'logger'):
logs = self.model.logger.name_to_value
if 'train/policy_gradient_loss' in logs:
loss = logs['train/policy_gradient_loss']
self.losses.append(loss)
return True

def _on_training_end(self):
""" Plot the loss after training ends """
name = self.model.__class__.__name__
plt.figure(figsize=(12, 4))
plt.plot(self.losses, label='Policy Gradient Loss')
plt.title(f'{name} - Policy Gradient Loss During Training')
plt.xlabel('Training Steps')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Define PPO Agent
class PPOAgent:

def __init__(self, env, total_timesteps, threshold):
self.model = PPO("MlpPolicy", env, verbose=1)
self.callback = PolicyGradientLossCallback()
self.model.learn(total_timesteps=total_timesteps, callback=self.callback)
self.threshold = threshold

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def predict(self, obs):
action, _ = self.model.predict(obs, deterministic=True)
return action

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def action_to_recommendation(self, action):
recommendations = []
for a in action:
if a > self.threshold:
recommendations.append('buy')
elif a < -self.threshold:
recommendations.append('sell')
else:
recommendations.append('hold')
return recommendations

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def validate(self, env):
obs = env.reset()
total_rewards = 0
for _ in range(1000): # Adjust based on needs
action, _ = self.model.predict(obs)
obs, reward, done, _ = env.step(action)
total_rewards += reward
if done:
obs = env.reset()
print(f'Agent Validation Reward: {total_rewards}')

# -----------------------------------------------------------------------------

# Define A2C Agent
class A2CAgent(PPOAgent):
def __init__(self, env, total_timesteps, threshold):
super().__init__(env, total_timesteps, threshold)
self.model = A2C("MlpPolicy", env, verbose=1)
self.callback = PolicyGradientLossCallback()
self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define DDPG Agent
class DDPGAgent(PPOAgent):
def __init__(self, env, total_timesteps, threshold):
super().__init__(env, total_timesteps, threshold)
self.model = DDPG("MlpPolicy", env, verbose=1)
self.callback = PolicyGradientLossCallback()
self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define SAC Agent
class SACAgent(PPOAgent):
def __init__(self, env, total_timesteps, threshold):
super().__init__(env, total_timesteps, threshold)
self.model = SAC("MlpPolicy", env, verbose=1)
self.callback = PolicyGradientLossCallback()
self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define TD3 Agent
class TD3Agent(PPOAgent):
def __init__(self, env, total_timesteps, threshold):
super().__init__(env, total_timesteps, threshold)
self.model = TD3("MlpPolicy", env, verbose=1)
self.callback = PolicyGradientLossCallback()
self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define Ensemble Agent
class EnsembleAgent:

def __init__(self, ppo_model, a2c_model, ddpg_model, sac_model, td3_model, threshold):
self.ppo_model = ppo_model
self.a2c_model = a2c_model
self.ddpg_model = ddpg_model
self.sac_model = sac_model
self.td3_model = td3_model
self.threshold = threshold

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def predict(self, obs):
ppo_action, _ = self.ppo_model.predict(obs, deterministic=True)
a2c_action, _ = self.a2c_model.predict(obs, deterministic=True)
ddpg_action, _ = self.ddpg_model.predict(obs, deterministic=True)
sac_action, _ = self.sac_model.predict(obs, deterministic=True)
td3_action, _ = self.td3_model.predict(obs, deterministic=True)

# Average the actions
ensemble_action = np.mean([ppo_action, a2c_action, ddpg_action, sac_action, td3_action], axis=0)
return ensemble_action

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def action_to_recommendation(self, action):
recommendations = []
for a in action:
if a > self.threshold:
recommendations.append('buy')
elif a < -self.threshold:
recommendations.append('sell')
else:
recommendations.append('hold')
return recommendations

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

def validate(self, env):
obs = env.reset()
total_rewards = 0
for _ in range(1000): # Adjust based on needs
action = self.predict(obs)
obs, reward, done, _ = env.step(action)
total_rewards += reward
if done:
obs = env.reset()
print(f'Agent Validation Reward: {total_rewards}')

Here’s a detailed overview of the auxiliary functions used to train and evaluate the trading agents:

1. Function to Create Environment and Train Agents

create_env_and_train_agents

  • Purpose: Initializes trading environments and trains various trading agents.
  • Functionality:
  • Environments: Creates training (train_env) and validation (val_env) environments using the StockTradingEnv class.
  • Agents: Trains and validates each agent (PPO, A2C, DDPG, SAC, TD3) using their respective classes and validation data.
  • Ensemble: Trains and validates an ensemble agent that combines predictions from all individual models.
  • Returns: Provides the initialized environments and trained agents for further analysis.

2. Visualization Functions

visualize_portfolio

  • Purpose: Plots balance, net worth, and shares held over time.
  • Parameters:
  • steps: List of time steps.
  • balances, net_worths, shares_held: Metrics tracked over time.
  • tickers: List of stock tickers.
  • show_balance, show_net_worth, show_shares_held: Flags to control which plots are displayed.
  • Functionality: Creates a multi-panel plot for balance, net worth, and shares held, allowing visual inspection of portfolio performance over time.

visualize_portfolio_net_worth

  • Purpose: Plots net worth over time.
  • Parameters:
  • steps: List of time steps.
  • net_worths: Net worth tracked over time.
  • Functionality: Creates a single plot for net worth, providing a clear view of the portfolio’s value progression.

visualize_multiple_portfolio_net_worth

  • Purpose: Compares the net worth of multiple portfolios on the same chart.
  • Parameters:
  • steps: List of time steps.
  • net_worths_list: List of net worth series for different agents.
  • labels: Labels for each agent’s net worth series.
  • Functionality: Plots net worth for multiple agents on one chart, facilitating direct comparison.

3. Testing Functions

test_agent

  • Purpose: Tests a single agent’s performance in the environment and tracks key metrics.
  • Parameters:
  • env: Environment to test the agent in.
  • agent: Agent to be tested.
  • stock_data: Data related to stocks for metrics tracking.
  • n_tests: Number of test iterations.
  • visualize: Flag to control rendering of environment during testing.
  • Functionality: Runs the agent in the environment, collects metrics (balance, net worth, shares held), and optionally visualizes the environment.

test_and_visualize_agents

  • Purpose: Tests multiple agents and visualizes their performance.
  • Parameters:
  • env: Environment to test agents.
  • agents: Dictionary of agents to be tested.
  • data: Stock data used for metrics tracking.
  • n_tests: Number of test iterations.
  • Functionality: Tests each agent, collects performance metrics, and generates comparative visualizations of net worth over time.

4. Performance Comparison Function

compare_and_plot_agents

  • Purpose: Compares agents based on their returns, standard deviation, and Sharpe ratio.
  • Parameters:
  • agents_metrics: Metrics collected from testing agents.
  • labels: Labels for each agent.
  • risk_free_rate: Risk-free rate for Sharpe ratio calculation.
  • Functionality:
  • Comparison: Calculates returns, standard deviation, and Sharpe ratio for each agent.
  • Visualization: Displays a sorted dataframe and bar chart comparing the Sharpe ratios of the agents, highlighting which agent performed best relative to risk-adjusted returns.

These functions provide a comprehensive toolkit for training, testing, and evaluating trading agents, allowing for in-depth analysis and comparison of different models.

# Function to create the environment and train the agents
def create_env_and_train_agents(train_data, val_data, total_timesteps, threshold):

# Create environments for training and validation
train_env = DummyVecEnv([lambda: StockTradingEnv(train_data)])
val_env = DummyVecEnv([lambda: StockTradingEnv(val_data)])

# Train and Validate PPO Agent
ppo_agent = PPOAgent(train_env, total_timesteps, threshold)
ppo_agent.validate(val_env)

# Train and Validate A2C Agent
a2c_agent = A2CAgent(train_env, total_timesteps, threshold)
a2c_agent.validate(val_env)

# Train and Validate DDPG Agent
ddpg_agent = DDPGAgent(train_env, total_timesteps, threshold)
ddpg_agent.validate(val_env)

# Train and Validate SAC Agent
sac_agent = SACAgent(train_env, total_timesteps, threshold)
sac_agent.validate(val_env)

# Train and Validate TD3 Agent
td3_agent = TD3Agent(train_env, total_timesteps, threshold)
td3_agent.validate(val_env)

# Train and Validate the ensemble agent
ensemble_agent = EnsembleAgent(ppo_agent.model, a2c_agent.model, ddpg_agent.model,
sac_agent.model, td3_agent.model, threshold)
ensemble_agent.validate(val_env)

return train_env, val_env, ppo_agent, a2c_agent, ddpg_agent, sac_agent, td3_agent, ensemble_agent

# -----------------------------------------------------------------------------

# Function to visualize portfolio changes
def visualize_portfolio(steps, balances, net_worths, shares_held, tickers,
show_balance=True, show_net_worth=True, show_shares_held=True):

fig, axs = plt.subplots(3, figsize=(12, 18))

# Plot the balance
if show_balance:
axs[0].plot(steps, balances, label='Balance')
axs[0].set_title('Balance Over Time')
axs[0].set_xlabel('Steps')
axs[0].set_ylabel('Balance')
axs[0].legend()

# Plot the net worth
if show_net_worth:
axs[1].plot(steps, net_worths, label='Net Worth', color='orange')
axs[1].set_title('Net Worth Over Time')
axs[1].set_xlabel('Steps')
axs[1].set_ylabel('Net Worth')
axs[1].legend()

# Plot the shares held
if show_shares_held:
for ticker in tickers:
axs[2].plot(steps, shares_held[ticker], label=f'Shares Held: {ticker}')
axs[2].set_title('Shares Held Over Time')
axs[2].set_xlabel('Steps')
axs[2].set_ylabel('Shares Held')
axs[2].legend()

plt.tight_layout()
plt.show()

# -----------------------------------------------------------------------------

# function to visualize the portfolio net worth
def visualize_portfolio_net_worth(steps, net_worths):

plt.figure(figsize=(12, 6))
plt.plot(steps, net_worths, label='Net Worth', color='orange')
plt.title('Net Worth Over Time')
plt.xlabel('Steps')
plt.ylabel('Net Worth')
plt.legend()
plt.show()

# -----------------------------------------------------------------------------

# function to visualize the multiple portfolio net worths ( same chart )
def visualize_multiple_portfolio_net_worth(steps, net_worths_list, labels):

plt.figure(figsize=(12, 6))
for i, net_worths in enumerate(net_worths_list):
plt.plot(steps, net_worths, label=labels[i])
plt.title('Net Worth Over Time')
plt.xlabel('Steps')
plt.ylabel('Net Worth')
plt.legend()
plt.show()

# -----------------------------------------------------------------------------

def test_agent(env, agent, stock_data, n_tests=1000, visualize=False):
""" Test a single agent and track performance metrics, with an option to visualize the results """

# Initialize metrics tracking
metrics = {
'steps': [],
'balances': [],
'net_worths': [],
'shares_held': {ticker: [] for ticker in stock_data.keys()}
}

# Reset the environment before starting the tests
obs = env.reset()

for i in range(n_tests):

metrics['steps'].append(i)

action = agent.predict(obs)

obs, rewards, dones, infos = env.step(action)

if visualize:
env.render()

# Track metrics
metrics['balances'].append(env.get_attr('balance')[0])
metrics['net_worths'].append(env.get_attr('net_worth')[0])
env_shares_held = env.get_attr('shares_held')[0]

# Update shares held for each ticker
for ticker in stock_data.keys():
if ticker in env_shares_held:
metrics['shares_held'][ticker].append(env_shares_held[ticker])
else:
metrics['shares_held'][ticker].append(0) # Append 0 if ticker is not found

if dones:
obs = env.reset()

return metrics

# -----------------------------------------------------------------------------

def test_and_visualize_agents(env, agents, data, n_tests=1000):

metrics = {}
for agent_name, agent in agents.items():
print(f"Testing {agent_name}...")
metrics[agent_name] = test_agent(env, agent, data, n_tests=n_tests, visualize=True)

# Extract net worths for visualization
net_worths = [metrics[agent_name]['net_worths'] for agent_name in agents.keys()]
steps = next(iter(metrics.values()))['steps'] # Assuming all agents have the same step count for simplicity

# Visualize the performance metrics of multiple agents
visualize_multiple_portfolio_net_worth(steps, net_worths, list(agents.keys()))

# -----------------------------------------------------------------------------

def compare_and_plot_agents(agents_metrics, labels, risk_free_rate=0.0):

# Function to compare returns, standard deviation, and sharpe ratio of agents
def compare_agents(agents_metrics, labels):
returns = []
stds = []
sharpe_ratios = []

for metrics in agents_metrics:

net_worths = metrics['net_worths']

# Calculate daily returns
daily_returns = np.diff(net_worths) / net_worths[:-1]
avg_return = np.mean(daily_returns)
std_return = np.std(daily_returns)
sharpe_ratio = ((avg_return - risk_free_rate) / std_return) if std_return != 0 else 'Inf'

returns.append(avg_return)
stds.append(std_return)
sharpe_ratios.append(sharpe_ratio)

df = pd.DataFrame({
'Agent': labels,
'Return': returns,
'Standard Deviation': stds,
'Sharpe Ratio': sharpe_ratios
})

return df

# Compare agents
df = compare_agents(agents_metrics, labels)

# Sort the dataframe by sharpe ratio
df_sorted = df.sort_values(by='Sharpe Ratio', ascending=False)

# Display the dataframe
display(df_sorted)

# Plot bar chart for sharpe ratio
plt.figure(figsize=(12, 6))
plt.bar(df_sorted['Agent'], df_sorted['Sharpe Ratio'])
plt.title('Sharpe Ratio Comparison')
plt.xlabel('Agent')
plt.ylabel('Sharpe Ratio')
plt.show()

Finally, we are able to train the Trading Agents:

Training Parameters Setup:

  • Threshold: The threshold value determines the minimum magnitude of the action that will trigger a buy or sell decision. In this example, it is set to 0.1.
  • Total Timesteps: This parameter specifies the total number of timesteps for which the agents will be trained. Here, it is set to 10,000 timesteps.

Environment Creation and Agent Training:

  • Environment Creation: This step initializes the training and validation environments using the StockTradingEnv class, tailored to the provided stock data.
  • Agent Training: The create_env_and_train_agents function trains various reinforcement learning agents (PPO, A2C, DDPG, SAC, TD3) using the training environment. Each agent is trained for a specified number of timesteps.
  • Ensemble Agent: An ensemble agent, which combines the predictions of all individual models, is also trained. This approach aims to leverage the strengths of each model and potentially improve overall performance.

The returned objects include the trained environments and agents, which are then ready for further evaluation and performance analysis.

# Create the environment and train the agents
threshold = 0.1
total_timesteps = 10000
train_env, val_env, ppo_agent, a2c_agent, ddpg_agent, sac_agent, td3_agent, ensemble_agent = \
create_env_and_train_agents(training_data, validation_data, total_timesteps, threshold)

We can also test & visualize the agents:

n_tests = 1000
agents = {
'PPO Agent': ppo_agent,
'A2C Agent': a2c_agent,
'DDPG Agent': ddpg_agent,
'SAC Agent': sac_agent,
'TD3 Agent': td3_agent,
'Ensemble Agent': ensemble_agent
}

test_and_visualize_agents(train_env, agents, training_data, n_tests=n_tests)

test_env = DummyVecEnv([lambda: StockTradingEnv(test_data)])
test_and_visualize_agents(test_env, agents, test_data, n_tests=n_tests)

The corresponding results shall be as following:

Training and Test sets performance

We also compare the agents’ performance on the test data (returns, standard deviation, and sharpe ratio).

From the paper:

The higher an agent’s Sharpe ratio, the better its returns have been relative to the amount of investment risk it has taken. Therefore, we pick the trading agent that can maximize the returns adjusted to the increasing risk.

test_agents_metrics = [test_agent(test_env, agent, test_data, n_tests=n_tests, visualize=False) for agent in agents.values()]
compare_and_plot_agents(test_agents_metrics, list(agents.keys()))

The corresponding results shall be as following:

Agents Comparison

Lastly, we can also use the model to suggest next-day recommendations:

def prepare_next_day_data(stock_data):
""" Prepares the observation for the next trading day """

# Initialize the environment with the current stock data
env = StockTradingEnv(stock_data)
env.reset()

# Prepare the next day's observation
next_day_observations = env._next_observation()

return next_day_observations

# -----------------------------------------------------------------------------

def generate_next_day_recommendations(agents, next_day_observation):
""" Generate recommendations for the next trading day using the trained agents """

recommendations = {agent_name: [] for agent_name in agents.keys()}

for agent_name, agent in agents.items():
action = agent.predict(next_day_observation)
recs = agent.action_to_recommendation(action)
recommendations[agent_name] = zip(recs, action)

return recommendations

# -----------------------------------------------------------------------------

# Prepare next day's observation
next_day_observation = prepare_next_day_data(test_data)

# Generate recommendations for the next trading day
recommendations = generate_next_day_recommendations(agents, next_day_observation)

# Print or display recommendations
for agent_name, recs in recommendations.items():
if agent_name == 'Ensemble Agent':
print(f'\nRecommendations for {agent_name}:')
for ticker, recommendation in zip(tickers, recs):
print(f"{ticker}: {recommendation}")

The corresponding results shall be as following:

Next day recommendations

Conclusions

We’ve navigated the intricate process of setting up and training reinforcement learning agents for stock trading using a custom trading environment. We began by designing a comprehensive environment that captures the nuances of stock trading, including transaction costs, state observations, and reward calculations. With this environment in place, we trained a variety of reinforcement learning agents — PPO, A2C, DDPG, SAC, and TD3 — each contributing its unique strengths to the trading strategy. By also implementing an ensemble agent that combines the predictions of all individual models, we aimed to maximize performance and robustness.

Our exploration demonstrates how these advanced algorithms can be applied to real-world trading scenarios, highlighting their potential to adapt and make informed decisions based on market data. The insights gained from this exercise not only showcase the power of reinforcement learning in finance but also emphasize the importance of rigorous evaluation and visualization in assessing agent performance. By continually refining our models and analyzing their results, we can strive towards more effective trading strategies and deeper understanding of market dynamics.

--

--