Deep Reinforcement Learning for Automated Stock Trading

21 min readJul 31, 2024

Deep Reinforcement Learning for Automated Stock Trading

References:

Yang, Hongyang, et al. “Deep reinforcement learning for automated stock trading: An ensemble strategy.” Proceedings of the first ACM international conference on AI in finance. 2020.‏
https://github.com/theanh97/Deep-Reinforcement-Learning-with-Stock-Trading
https://medium.com/@pta.forwork/deep-reinforcement-learning-for-automated-stock-trading-9d47457707fa

Overview

This project aims to create an automated stock trading system utilizing deep reinforcement learning (DRL). By applying advanced machine learning techniques, the system will be trained to make profitable trading decisions in the stock market. The project includes downloading historical stock data, enhancing it with technical indicators, developing a simulated trading environment, and training several DRL agents such as PPO, A2C, DDPG, SAC, TD3, and an ensemble agent. The performance of these agents will be assessed and compared using various financial metrics.

This project is inspired by the research paper mentioned above as “reference 1". It is also influenced by “references 2 and 3”, used them as a technical baseline, adding several fixes and enhancements for the presented work there.

Deep Reinforcement Learning (DRL)

Deep learning in the context of reinforcement learning involves an agent interacting with an environment to learn optimal behaviors through trial and error. The main terminologies and concepts in this domain are crucial for understanding how these systems operate:

Environment: The external system with which the agent interacts. It provides a setting where actions can be taken, and responses (in the form of observations and rewards) are received. Examples include game worlds, robotic systems, or any simulated scenario where learning takes place.
State: A representation of the current situation or configuration of the environment. The state captures all necessary information required for decision-making. In a game, a state could include the positions of all characters and objects.
Observations: The data perceived by the agent from the environment. Observations can be partial or complete representations of the state. In many scenarios, the agent does not have access to the full state and must rely on observations to infer it.
Actions: The decisions or moves made by the agent in response to the current state or observation. Actions alter the state of the environment. The set of all possible actions an agent can take is called the action space.
Step: A single iteration in the interaction cycle between the agent and the environment. During a step, the agent takes an action based on its current policy, receives an observation and a reward from the environment, and transitions to a new state.
Policy: A strategy employed by the agent to decide which action to take given a state or observation. Policies can be deterministic (always the same action for a given state) or stochastic (actions are chosen according to a probability distribution).
Reward: A scalar feedback signal received by the agent after taking an action. The reward quantifies the immediate benefit of the action and is used to reinforce desirable behaviors. The goal of the agent is to maximize the cumulative reward over time.
Episode: A sequence of steps from an initial state to a terminal state. An episode ends when a predefined condition is met, such as reaching a goal or running out of time.

In reinforcement learning, the agent aims to learn a policy that maximizes the cumulative reward by repeatedly interacting with the environment, taking actions, and adjusting its policy based on the received rewards. The combination of these elements allows the agent to develop sophisticated behaviors and improve its performance through experience.

Following is a brief overview for common reinforcement learning algorithms PPO, A2C, DDPG, SAC, and TD3, aka “DRL agents”:

Implementation

To start our journey into Deep Reinforcement Learning for Automated Stock Trading, we first need to gather the necessary data. This initial step involves loading historical stock data, which serves as the foundation for our trading models.

In this project, we use Python libraries such as numpy, pandas, and yfinance to fetch and manipulate stock market data. Specifically, we focus on the Dow Jones 30, a list of 30 prominent stocks. We utilize Yahoo Finance to download historical data for these stocks, spanning from January 1, 2009, to May 8, 2020. This data will be essential for training and testing our reinforcement learning models. By creating a dictionary to store this data, we ensure efficient and organized access throughout the project. This setup allows us to analyze and preprocess the stock data before feeding it into our trading algorithms.

import numpy as np
import pandas as pd
import yfinance as yf
import gymnasium as gym
from gymnasium import spaces
import matplotlib.pyplot as plt
from stable_baselines3 import PPO, A2C, DDPG, SAC, TD3
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import BaseCallback

# List of stocks in the Dow Jones 30
tickers = [
    'MMM', 'AXP', 'AAPL', 'BA', 'CAT', 'CVX', 'CSCO', 'KO', 'DIS', 'DOW',
    'GS', 'HD', 'IBM', 'INTC', 'JNJ', 'JPM', 'MCD', 'MRK', 'MSFT', 'NKE',
    'PFE', 'PG', 'TRV', 'UNH', 'UTX', 'VZ', 'V', 'WBA', 'WMT', 'XOM'
]
tickers.remove('DOW')
tickers.remove('UTX')

# Get historical data from Yahoo Finance and save it to dictionary
def fetch_stock_data(tickers, start_date, end_date):
    stock_data = {}
    for ticker in tickers:
        stock_data[ticker] = yf.download(ticker, start=start_date, end=end_date)
    return stock_data

# Call the function to get data
stock_data = fetch_stock_data(tickers, '2009-01-01', '2020-05-08')

To ensure our models are robust and generalizable, we split the historical stock data into three distinct sets: training, validation, and test datasets. This allows us to train the model on one set of data, validate its performance on a second set, and finally test its effectiveness on a third set, ensuring that our model performs well on unseen data.

For this project, we designate the period from January 1, 2009, to December 31, 2015, as the training dataset. This is the largest dataset and is used to train our reinforcement learning models. The validation dataset, spanning from January 1, 2016, to December 31, 2016, is used to fine-tune the model and prevent overfitting. Finally, the test dataset, from January 1, 2017, to May 8, 2020, is used to evaluate the model’s performance in real-world scenarios.

We proceed by splitting the data for each stock in the Dow Jones 30 accordingly. By plotting the open prices of Apple Inc. (AAPL) across these three periods, we visualize the data distribution and ensure the splits are correctly implemented. This careful partitioning is critical for the development of a reliable automated trading system.

# split the data into training, validation and test sets
training_data_time_range = ('2009-01-01', '2015-12-31')
validation_data_time_range = ('2016-01-01', '2016-12-31')
test_data_time_range = ('2017-01-01', '2020-05-08')

# split the data into training, validation and test sets
training_data = {}
validation_data = {}
test_data = {}

for ticker, df in stock_data.items():
    training_data[ticker] = df.loc[training_data_time_range[0]:training_data_time_range[1]]
    validation_data[ticker] = df.loc[validation_data_time_range[0]:validation_data_time_range[1]]
    test_data[ticker] = df.loc[test_data_time_range[0]:test_data_time_range[1]]

# print shape of training, validation and test data
ticker = 'AAPL'
print(f'- Training data shape for {ticker}: {training_data[ticker].shape}')
print(f'- Validation data shape for {ticker}: {validation_data[ticker].shape}')
print(f'- Test data shape for {ticker}: {test_data[ticker].shape}\n')

# Display the first 5 rows of the data
display(stock_data['AAPL'].head())
print('\n')

# Plot:
plt.figure(figsize=(12, 4))
plt.plot(training_data[ticker].index, training_data[ticker]['Open'], label='Training', color='blue')
plt.plot(validation_data[ticker].index, validation_data[ticker]['Open'], label='Validation', color='red')
plt.plot(test_data[ticker].index, test_data[ticker]['Open'], label='Test', color='green')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title(f'{ticker} Stock, Open Price')
plt.legend()
plt.show()

The corresponding results shall be as following:

Next, we enrich our datasets with various technical indicators that are essential for trading strategy development. We calculate several key metrics:

MACD (Moving Average Convergence Divergence): This indicator involves computing the 12-day and 26-day Exponential Moving Averages (EMAs) to determine the MACD line, and then applying a 9-day EMA to the MACD line to generate the Signal line. The MACD and Signal lines help in identifying potential buy or sell signals based on their crossovers.
RSI (Relative Strength Index): We calculate the RSI with a 14-day window to gauge the momentum of price movements. This indicator helps to identify overbought or oversold conditions by measuring the speed and change of price movements.
CCI (Commodity Channel Index): This indicator assesses the deviation of the price from its average, which helps in identifying new trends or extreme conditions. We use a 20-day window to calculate the CCI.
ADX (Average Directional Index): To measure the strength of a trend, we compute the ADX using a 14-day window. This involves calculating the Directional Movement (DM) indicators and the Average True Range (ATR) to determine the ADX value.

By adding these indicators, we transform the raw stock price data into a feature-rich dataset that better captures market trends and price dynamics. This enhanced dataset is then used to train and evaluate our reinforcement learning models.

def add_technical_indicators(df):

    df = df.copy()

    # Calculate EMA 12 and 26 for MACD
    df.loc[:, 'EMA12'] = df['Close'].ewm(span=12, adjust=False).mean()
    df.loc[:, 'EMA26'] = df['Close'].ewm(span=26, adjust=False).mean()
    df.loc[:, 'MACD'] = df['EMA12'] - df['EMA26']
    df.loc[:, 'Signal'] = df['MACD'].ewm(span=9, adjust=False).mean()

    # Calculate RSI 14
    rsi_14_mode = True
    delta = df['Close'].diff()
    if rsi_14_mode:
        gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
        loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
        rs = gain / loss
    else:
        up = delta.where(delta > 0, 0)
        down = -delta.where(delta < 0, 0)
        rs = up.rolling(window=14).mean() / down.rolling(window=14).mean()
    df.loc[:, 'RSI'] = 100 - (100 / (1 + rs))

    # Calculate CCI 20
    tp = (df['High'] + df['Low'] + df['Close']) / 3
    sma_tp = tp.rolling(window=20).mean()
    mean_dev = tp.rolling(window=20).apply(lambda x: np.mean(np.abs(x - x.mean())))
    df.loc[:, 'CCI'] = (tp - sma_tp) / (0.015 * mean_dev)

    # Calculate ADX 14
    high_diff = df['High'].diff()
    low_diff = df['Low'].diff()
    df.loc[:, '+DM'] = np.where((high_diff > low_diff) & (high_diff > 0), high_diff, 0)
    df.loc[:, '-DM'] = np.where((low_diff > high_diff) & (low_diff > 0), low_diff, 0)
    tr = pd.concat([df['High'] - df['Low'], np.abs(df['High'] - df['Close'].shift(1)), np.abs(df['Low'] - df['Close'].shift(1))], axis=1).max(axis=1)
    atr = tr.ewm(span=14, adjust=False).mean()
    df.loc[:, '+DI'] = 100 * (df['+DM'].ewm(span=14, adjust=False).mean() / atr)
    df.loc[:, '-DI'] = 100 * (df['-DM'].ewm(span=14, adjust=False).mean() / atr)
    dx = 100 * np.abs(df['+DI'] - df['-DI']) / (df['+DI'] + df['-DI'])
    df.loc[:, 'ADX'] = dx.ewm(span=14, adjust=False).mean()

    # Drop NaN values
    df.dropna(inplace=True)

    # Keep only the required columns
    df = df[['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'Signal', 'RSI', 'CCI', 'ADX']]

    return df

# -----------------------------------------------------------------------------

# add technical indicators to the training data for each stock
for ticker, df in training_data.items():
    training_data[ticker] = add_technical_indicators(df)

# add technical indicators to the validation data for each stock
for ticker, df in validation_data.items():
    validation_data[ticker] = add_technical_indicators(df)

# add technical indicators to the test data for each stock
for ticker, df in test_data.items():
    test_data[ticker] = add_technical_indicators(df)

# print the first 5 rows of the data
print(f'- Training data shape for {ticker}: {training_data[ticker].shape}')
print(f'- Validation data shape for {ticker}: {validation_data[ticker].shape}')
print(f'- Test data shape for {ticker}: {test_data[ticker].shape}\n')

display(test_data[ticker].head())

The corresponding results shall be as following:

In the next section, we define a custom trading environment for our reinforcement learning model using the OpenAI Gym framework. This environment simulates stock trading and allows the agent to interact with the market through actions like buying, selling, or holding stocks.

Key Features of the Environment:

Initialization: The environment is initialized with historical stock data and sets up various parameters, including action and observation spaces, transaction costs, and account variables such as balance, net worth, and shares held.
Observation Space: At each step, the environment provides a comprehensive state that includes current stock prices, account balance, shares held, net worth, and other relevant metrics. This observation space is crucial for the agent to make informed decisions.
Action Space: The action space is defined as a continuous space where the agent can decide the proportion of the portfolio to buy or sell for each stock. Positive values represent buying actions, while negative values represent selling actions.
Step Function: The step function executes the agent's actions, updates the account balance and shares held, calculates the new net worth, and determines the reward. It also manages transaction costs and checks if the episode should end based on the maximum number of steps or if the net worth falls below zero.
Rendering: The render function provides a human-readable output of the current state, including the step number, balance, shares held, net worth, and profit.
Reset: The reset function reinitializes the environment for a new episode, ensuring that the agent starts with the initial conditions and data.

This custom environment is designed to closely mimic real-world trading scenarios, providing the reinforcement learning agent with the necessary tools to learn and optimize trading strategies.

class StockTradingEnv(gym.Env):

    metadata = {'render_modes': ['human']}

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def __init__(self, stock_data, transaction_cost_percent=0.005):
        super(StockTradingEnv, self).__init__()
        """
        This function initializes the environment with stock data and sets up necessary variables:
        - Action and Observation Space: Defines the action space (buy/sell/hold) and
                                        observation space (stock prices, balance, shares held, net worth, etc.).
        - Account Variables: Initializes balance, net worth, shares held, and transaction costs.
        """

        # Remove any empty DataFrames
        self.stock_data = {ticker: df for ticker, df in stock_data.items() if not df.empty}
        self.tickers = list(self.stock_data.keys())

        if not self.tickers:
            raise ValueError("All provided stock data is empty")

        # Calculate the size of one stock's data
        sample_df = next(iter(self.stock_data.values()))
        self.n_features = len(sample_df.columns)

        # Define action and observation space
        self.action_space = spaces.Box(low=-1, high=1, shape=(len(self.tickers),), dtype=np.float32)

        # Observation space: price data for each stock + balance + shares held + net worth + max net worth + current step
        self.obs_shape = self.n_features * len(self.tickers) + 2 + len(self.tickers) + 2
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(self.obs_shape,), dtype=np.float32)

        # Initialize account balance
        self.initial_balance = 1000
        self.balance = self.initial_balance
        self.net_worth = self.initial_balance
        self.max_net_worth = self.initial_balance
        self.shares_held = {ticker: 0 for ticker in self.tickers}
        self.total_shares_sold = {ticker: 0 for ticker in self.tickers}
        self.total_sales_value = {ticker: 0 for ticker in self.tickers}

        # Set the current step
        self.current_step = 0

        # Calculate the minimum length of data across all stocks
        self.max_steps = max(0, min(len(df) for df in self.stock_data.values()) - 1)

        # Transaction cost
        self.transaction_cost_percent = transaction_cost_percent

        # Short Strategy
        self.short = False

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def reset(self, seed=None, options=None):
        super().reset(seed=seed)
        """ Resets the environment to its initial state for a new episode. """

        # Reset the account balance
        self.balance = self.initial_balance
        self.net_worth = self.initial_balance
        self.max_net_worth = self.initial_balance
        self.shares_held = {ticker: 0 for ticker in self.tickers}
        self.total_shares_sold = {ticker: 0 for ticker in self.tickers}
        self.total_sales_value = {ticker: 0 for ticker in self.tickers}
        self.current_step = 0
        return self._next_observation(), {}

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def _next_observation(self):
        """ Returns the current state of the environment, including stock prices, balance, shares held, net worth, etc. """

        # initialize the frame
        frame = np.zeros(self.obs_shape)

        # Add stock data for each ticker
        idx = 0
        # Loop through each ticker
        for ticker in self.tickers:
            # Get the DataFrame for the current ticker
            df = self.stock_data[ticker]
            # If the current step is less than the length of the DataFrame, add the price data for the current step
            if self.current_step < len(df):
                frame[idx:idx+self.n_features] = df.iloc[self.current_step].values
            # Otherwise, add the last price data available
            elif len(df) > 0:
                frame[idx:idx+self.n_features] = df.iloc[-1].values
            # Move the index to the next ticker
            idx += self.n_features

        # Add balance, shares held, net worth, max net worth, and current step
        frame[-4-len(self.tickers)] = self.balance # Balance
        frame[-3-len(self.tickers):-3] = [self.shares_held[ticker] for ticker in self.tickers] # Shares held
        frame[-3] = self.net_worth # Net worth
        frame[-2] = self.max_net_worth # Max net worth
        frame[-1] = self.current_step # Current step

        return frame

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def step(self, actions):
        """ Executes an action in the environment, updates the state, calculates rewards, and checks if the episode is done. """

        # Update the current step
        self.current_step += 1

        # Check if we have reached the maximum number of steps
        if self.current_step > self.max_steps:
            return self._next_observation(), 0, True, False, {}

        close_prices = {}

        # Loop through each ticker and perform the action
        for i, ticker in enumerate(self.tickers):

            # Get the current open and close price of the stock
            current_day = self.stock_data[ticker].iloc[self.current_step]
            open_price = current_day['Open']
            close_price = current_day['Close']

            # Record the close price
            close_prices[ticker] = close_price

            # Get the action for the current ticker
            action = actions[i]

            action_price = open_price if self.short else close_price

            if action > 0:  # Buy
                # Calculate the number of shares to buy
                shares_to_buy = int(self.balance * action / action_price)
                # Calculate the cost of the shares
                cost = shares_to_buy * action_price
                # Transaction cost
                transaction_cost = cost * self.transaction_cost_percent
                # Update the balance and shares held
                self.balance -= (cost + transaction_cost)
                # Update the total shares sold
                self.shares_held[ticker] += shares_to_buy

            elif action < 0:  # Sell
                # Calculate the number of shares to sell
                shares_to_sell = int(self.shares_held[ticker] * abs(action))
                # Calculate the sale value
                sale = shares_to_sell * action_price
                # Transaction cost
                transaction_cost = sale * self.transaction_cost_percent
                # Update the balance and shares held
                self.balance += (sale - transaction_cost)
                # Update the total shares sold
                self.shares_held[ticker] -= shares_to_sell
                # Update the shares sold
                self.total_shares_sold[ticker] += shares_to_sell
                # Update the total sales value
                self.total_sales_value[ticker] += sale

        # Calculate the net worth
        self.net_worth = self.balance + sum(self.shares_held[ticker] * close_prices[ticker] for ticker in self.tickers)
  
        # Update the max net worth
        self.max_net_worth = max(self.net_worth, self.max_net_worth)

        # Calculate the reward
        reward = self.net_worth - self.initial_balance

        # Check if the episode is done
        done = self.net_worth <= 0 or self.current_step >= self.max_steps

        obs = self._next_observation()

        return obs, reward, done, False, {}

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def render(self, mode='human'):
        """ Displays the current state of the environment in a human-readable format. """

        # Print the current step, balance, shares held, net worth, and profit
        profit = self.net_worth - self.initial_balance
        print(f'Step: {self.current_step}')
        print(f'Balance: {self.balance:.2f}')
        for ticker in self.tickers:
            print(f'{ticker} Shares held: {self.shares_held[ticker]}')
        print(f'Net worth: {self.net_worth:.2f}')
        print(f'Profit: {profit:.2f}')

    def close(self):
        """ Placeholder for any cleanup operations """
        pass

Next, we set up various reinforcement learning agents to interact with the trading environment defined earlier. Each agent is based on a different reinforcement learning algorithm, and an ensemble agent combines the strengths of these individual models.

PolicyGradientLossCallback Class

Purpose: This custom callback logs the policy gradient loss during training for performance monitoring.
Functionality:
_on_step: Captures and appends the policy gradient loss from the model's logger.
_on_training_end: Plots the policy gradient loss after training to visualize the loss progression over time.

Individual Trading Agents

PPOAgent (Proximal Policy Optimization)

Initialization: Sets up the PPO model with a specified number of timesteps and a threshold for action decisions.
Methods:
predict: Returns the action decided by the PPO model for a given observation.
action_to_recommendation: Converts the model's action into trading recommendations (buy/sell/hold) based on a threshold.
validate: Evaluates the agent's performance by running it in the environment and calculating the total rewards.

2. Other agents:

A2CAgent (Advantage Actor-Critic)
DDPGAgent (Deep Deterministic Policy Gradient)
SACAgent (Soft Actor-Critic)
TD3Agent (Twin Delayed Deep Deterministic Policy Gradient)
Initialization: Inherits from PPOAgent but uses the A2C, DDPG, SAC, TD3 algorithms, respectively. They also include the PolicyGradientLossCallback to track loss during training.

3. EnsembleAgent

Purpose: Combines the predictions from multiple individual models (PPO, A2C, DDPG, SAC, and TD3) to make a final decision. This ensemble approach aims to leverage the strengths of each algorithm.
Methods:
predict: Averages the actions predicted by each individual model to determine the ensemble action.
action_to_recommendation: Translates the ensemble action into buy/sell/hold recommendations based on a threshold.
validate: Tests the ensemble agent's performance in the environment and calculates the total rewards.

These agents are designed to handle trading decisions in the environment and are validated to ensure their effectiveness in maximizing rewards and making informed trading choices.

class PolicyGradientLossCallback(BaseCallback):
    """
    A custom callback class that logs the policy_gradient_loss during training.
    This class extends BaseCallback and used to capture and store the metrics we want.
    """

    def __init__(self, verbose=0):
        super(PolicyGradientLossCallback, self).__init__(verbose)
        self.losses = []

    def _on_step(self) -> bool:
        if hasattr(self.model, 'logger'):
            logs = self.model.logger.name_to_value
            if 'train/policy_gradient_loss' in logs:
                loss = logs['train/policy_gradient_loss']
                self.losses.append(loss)
        return True

    def _on_training_end(self):
        """ Plot the loss after training ends """
        name = self.model.__class__.__name__
        plt.figure(figsize=(12, 4))
        plt.plot(self.losses, label='Policy Gradient Loss')
        plt.title(f'{name} - Policy Gradient Loss During Training')
        plt.xlabel('Training Steps')
        plt.ylabel('Loss')
        plt.legend()
        plt.show()

# Define PPO Agent
class PPOAgent:

    def __init__(self, env, total_timesteps, threshold):
        self.model = PPO("MlpPolicy", env, verbose=1)
        self.callback = PolicyGradientLossCallback()
        self.model.learn(total_timesteps=total_timesteps, callback=self.callback)
        self.threshold = threshold

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def predict(self, obs):
        action, _ = self.model.predict(obs, deterministic=True)
        return action

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def action_to_recommendation(self, action):
        recommendations = []
        for a in action:
            if a > self.threshold:
                recommendations.append('buy')
            elif a < -self.threshold:
                recommendations.append('sell')
            else:
                recommendations.append('hold')
        return recommendations

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def validate(self, env):
        obs = env.reset()
        total_rewards = 0
        for _ in range(1000):  # Adjust based on needs
            action, _ = self.model.predict(obs)
            obs, reward, done, _ = env.step(action)
            total_rewards += reward
            if done:
                obs = env.reset()
        print(f'Agent Validation Reward: {total_rewards}')

# -----------------------------------------------------------------------------

# Define A2C Agent
class A2CAgent(PPOAgent):
    def __init__(self, env, total_timesteps, threshold):
        super().__init__(env, total_timesteps, threshold)
        self.model = A2C("MlpPolicy", env, verbose=1)
        self.callback = PolicyGradientLossCallback()
        self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define DDPG Agent
class DDPGAgent(PPOAgent):
    def __init__(self, env, total_timesteps, threshold):
        super().__init__(env, total_timesteps, threshold)
        self.model = DDPG("MlpPolicy", env, verbose=1)
        self.callback = PolicyGradientLossCallback()
        self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define SAC Agent
class SACAgent(PPOAgent):
    def __init__(self, env, total_timesteps, threshold):
        super().__init__(env, total_timesteps, threshold)
        self.model = SAC("MlpPolicy", env, verbose=1)
        self.callback = PolicyGradientLossCallback()
        self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define TD3 Agent
class TD3Agent(PPOAgent):
    def __init__(self, env, total_timesteps, threshold):
        super().__init__(env, total_timesteps, threshold)
        self.model = TD3("MlpPolicy", env, verbose=1)
        self.callback = PolicyGradientLossCallback()
        self.model.learn(total_timesteps=total_timesteps, callback=self.callback)

# -----------------------------------------------------------------------------

# Define Ensemble Agent
class EnsembleAgent:

    def __init__(self, ppo_model, a2c_model, ddpg_model, sac_model, td3_model, threshold):
        self.ppo_model = ppo_model
        self.a2c_model = a2c_model
        self.ddpg_model = ddpg_model
        self.sac_model = sac_model
        self.td3_model = td3_model
        self.threshold = threshold

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def predict(self, obs):
        ppo_action, _ = self.ppo_model.predict(obs, deterministic=True)
        a2c_action, _ = self.a2c_model.predict(obs, deterministic=True)
        ddpg_action, _ = self.ddpg_model.predict(obs, deterministic=True)
        sac_action, _ = self.sac_model.predict(obs, deterministic=True)
        td3_action, _ = self.td3_model.predict(obs, deterministic=True)

        # Average the actions
        ensemble_action = np.mean([ppo_action, a2c_action, ddpg_action, sac_action, td3_action], axis=0)
        return ensemble_action

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def action_to_recommendation(self, action):
        recommendations = []
        for a in action:
            if a > self.threshold:
                recommendations.append('buy')
            elif a < -self.threshold:
                recommendations.append('sell')
            else:
                recommendations.append('hold')
        return recommendations

    # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

    def validate(self, env):
        obs = env.reset()
        total_rewards = 0
        for _ in range(1000):  # Adjust based on needs
            action = self.predict(obs)
            obs, reward, done, _ = env.step(action)
            total_rewards += reward
            if done:
                obs = env.reset()
        print(f'Agent Validation Reward: {total_rewards}')

Here’s a detailed overview of the auxiliary functions used to train and evaluate the trading agents:

1. Function to Create Environment and Train Agents

create_env_and_train_agents

Purpose: Initializes trading environments and trains various trading agents.
Functionality:
Environments: Creates training (train_env) and validation (val_env) environments using the StockTradingEnv class.
Agents: Trains and validates each agent (PPO, A2C, DDPG, SAC, TD3) using their respective classes and validation data.
Ensemble: Trains and validates an ensemble agent that combines predictions from all individual models.
Returns: Provides the initialized environments and trained agents for further analysis.

2. Visualization Functions

visualize_portfolio

Purpose: Plots balance, net worth, and shares held over time.
Parameters:
steps: List of time steps.
balances, net_worths, shares_held: Metrics tracked over time.
tickers: List of stock tickers.
show_balance, show_net_worth, show_shares_held: Flags to control which plots are displayed.
Functionality: Creates a multi-panel plot for balance, net worth, and shares held, allowing visual inspection of portfolio performance over time.

visualize_portfolio_net_worth

Purpose: Plots net worth over time.
Parameters:
steps: List of time steps.
net_worths: Net worth tracked over time.
Functionality: Creates a single plot for net worth, providing a clear view of the portfolio’s value progression.

visualize_multiple_portfolio_net_worth

Purpose: Compares the net worth of multiple portfolios on the same chart.
Parameters:
steps: List of time steps.
net_worths_list: List of net worth series for different agents.
labels: Labels for each agent’s net worth series.
Functionality: Plots net worth for multiple agents on one chart, facilitating direct comparison.

3. Testing Functions

test_agent

Purpose: Tests a single agent’s performance in the environment and tracks key metrics.
Parameters:
env: Environment to test the agent in.
agent: Agent to be tested.
stock_data: Data related to stocks for metrics tracking.
n_tests: Number of test iterations.
visualize: Flag to control rendering of environment during testing.
Functionality: Runs the agent in the environment, collects metrics (balance, net worth, shares held), and optionally visualizes the environment.

test_and_visualize_agents

Purpose: Tests multiple agents and visualizes their performance.
Parameters:
env: Environment to test agents.
agents: Dictionary of agents to be tested.
data: Stock data used for metrics tracking.
n_tests: Number of test iterations.
Functionality: Tests each agent, collects performance metrics, and generates comparative visualizations of net worth over time.

4. Performance Comparison Function

compare_and_plot_agents

Purpose: Compares agents based on their returns, standard deviation, and Sharpe ratio.
Parameters:
agents_metrics: Metrics collected from testing agents.
labels: Labels for each agent.
risk_free_rate: Risk-free rate for Sharpe ratio calculation.
Functionality:
Comparison: Calculates returns, standard deviation, and Sharpe ratio for each agent.
Visualization: Displays a sorted dataframe and bar chart comparing the Sharpe ratios of the agents, highlighting which agent performed best relative to risk-adjusted returns.

These functions provide a comprehensive toolkit for training, testing, and evaluating trading agents, allowing for in-depth analysis and comparison of different models.

# Function to create the environment and train the agents
def create_env_and_train_agents(train_data, val_data, total_timesteps, threshold):

    # Create environments for training and validation
    train_env = DummyVecEnv([lambda: StockTradingEnv(train_data)])
    val_env = DummyVecEnv([lambda: StockTradingEnv(val_data)])

    # Train and Validate PPO Agent
    ppo_agent = PPOAgent(train_env, total_timesteps, threshold)
    ppo_agent.validate(val_env)

    # Train and Validate A2C Agent
    a2c_agent = A2CAgent(train_env, total_timesteps, threshold)
    a2c_agent.validate(val_env)

    # Train and Validate DDPG Agent
    ddpg_agent = DDPGAgent(train_env, total_timesteps, threshold)
    ddpg_agent.validate(val_env)

    # Train and Validate SAC Agent
    sac_agent = SACAgent(train_env, total_timesteps, threshold)
    sac_agent.validate(val_env)

    # Train and Validate TD3 Agent
    td3_agent = TD3Agent(train_env, total_timesteps, threshold)
    td3_agent.validate(val_env)

    # Train and Validate the ensemble agent
    ensemble_agent = EnsembleAgent(ppo_agent.model, a2c_agent.model, ddpg_agent.model,
                                   sac_agent.model, td3_agent.model, threshold)
    ensemble_agent.validate(val_env)

    return train_env, val_env, ppo_agent, a2c_agent, ddpg_agent, sac_agent, td3_agent, ensemble_agent

# -----------------------------------------------------------------------------

# Function to visualize portfolio changes
def visualize_portfolio(steps, balances, net_worths, shares_held, tickers,
                        show_balance=True, show_net_worth=True, show_shares_held=True):

    fig, axs = plt.subplots(3, figsize=(12, 18))

    # Plot the balance
    if show_balance:
        axs[0].plot(steps, balances, label='Balance')
        axs[0].set_title('Balance Over Time')
        axs[0].set_xlabel('Steps')
        axs[0].set_ylabel('Balance')
        axs[0].legend()

    # Plot the net worth
    if show_net_worth:
        axs[1].plot(steps, net_worths, label='Net Worth', color='orange')
        axs[1].set_title('Net Worth Over Time')
        axs[1].set_xlabel('Steps')
        axs[1].set_ylabel('Net Worth')
        axs[1].legend()

    # Plot the shares held
    if show_shares_held:
        for ticker in tickers:
            axs[2].plot(steps, shares_held[ticker], label=f'Shares Held: {ticker}')
        axs[2].set_title('Shares Held Over Time')
        axs[2].set_xlabel('Steps')
        axs[2].set_ylabel('Shares Held')
        axs[2].legend()

    plt.tight_layout()
    plt.show()

# -----------------------------------------------------------------------------

# function to visualize the portfolio net worth
def visualize_portfolio_net_worth(steps, net_worths):

    plt.figure(figsize=(12, 6))
    plt.plot(steps, net_worths, label='Net Worth', color='orange')
    plt.title('Net Worth Over Time')
    plt.xlabel('Steps')
    plt.ylabel('Net Worth')
    plt.legend()
    plt.show()

# -----------------------------------------------------------------------------

# function to visualize the multiple portfolio net worths ( same chart )
def visualize_multiple_portfolio_net_worth(steps, net_worths_list, labels):

    plt.figure(figsize=(12, 6))
    for i, net_worths in enumerate(net_worths_list):
        plt.plot(steps, net_worths, label=labels[i])
    plt.title('Net Worth Over Time')
    plt.xlabel('Steps')
    plt.ylabel('Net Worth')
    plt.legend()
    plt.show()

# -----------------------------------------------------------------------------

def test_agent(env, agent, stock_data, n_tests=1000, visualize=False):
    """ Test a single agent and track performance metrics, with an option to visualize the results """

    # Initialize metrics tracking
    metrics = {
        'steps': [],
        'balances': [],
        'net_worths': [],
        'shares_held': {ticker: [] for ticker in stock_data.keys()}
    }

    # Reset the environment before starting the tests
    obs = env.reset()

    for i in range(n_tests):

        metrics['steps'].append(i)

        action = agent.predict(obs)

        obs, rewards, dones, infos = env.step(action)

        if visualize:
            env.render()

        # Track metrics
        metrics['balances'].append(env.get_attr('balance')[0])
        metrics['net_worths'].append(env.get_attr('net_worth')[0])
        env_shares_held = env.get_attr('shares_held')[0]

        # Update shares held for each ticker
        for ticker in stock_data.keys():
            if ticker in env_shares_held:
                metrics['shares_held'][ticker].append(env_shares_held[ticker])
            else:
                metrics['shares_held'][ticker].append(0)  # Append 0 if ticker is not found

        if dones:
            obs = env.reset()

    return metrics

# -----------------------------------------------------------------------------

def test_and_visualize_agents(env, agents, data, n_tests=1000):

    metrics = {}
    for agent_name, agent in agents.items():
        print(f"Testing {agent_name}...")
        metrics[agent_name] = test_agent(env, agent, data, n_tests=n_tests, visualize=True)

    # Extract net worths for visualization
    net_worths = [metrics[agent_name]['net_worths'] for agent_name in agents.keys()]
    steps = next(iter(metrics.values()))['steps']  # Assuming all agents have the same step count for simplicity

    # Visualize the performance metrics of multiple agents
    visualize_multiple_portfolio_net_worth(steps, net_worths, list(agents.keys()))

# -----------------------------------------------------------------------------

def compare_and_plot_agents(agents_metrics, labels, risk_free_rate=0.0):

    # Function to compare returns, standard deviation, and sharpe ratio of agents
    def compare_agents(agents_metrics, labels):
        returns = []
        stds = []
        sharpe_ratios = []

        for metrics in agents_metrics:

            net_worths = metrics['net_worths']

            # Calculate daily returns
            daily_returns = np.diff(net_worths) / net_worths[:-1]
            avg_return = np.mean(daily_returns)
            std_return = np.std(daily_returns)
            sharpe_ratio = ((avg_return - risk_free_rate) / std_return) if std_return != 0 else 'Inf'

            returns.append(avg_return)
            stds.append(std_return)
            sharpe_ratios.append(sharpe_ratio)

        df = pd.DataFrame({
            'Agent': labels,
            'Return': returns,
            'Standard Deviation': stds,
            'Sharpe Ratio': sharpe_ratios
        })

        return df

    # Compare agents
    df = compare_agents(agents_metrics, labels)

    # Sort the dataframe by sharpe ratio
    df_sorted = df.sort_values(by='Sharpe Ratio', ascending=False)

    # Display the dataframe
    display(df_sorted)

    # Plot bar chart for sharpe ratio
    plt.figure(figsize=(12, 6))
    plt.bar(df_sorted['Agent'], df_sorted['Sharpe Ratio'])
    plt.title('Sharpe Ratio Comparison')
    plt.xlabel('Agent')
    plt.ylabel('Sharpe Ratio')
    plt.show()

Finally, we are able to train the Trading Agents:

Training Parameters Setup:

Threshold: The threshold value determines the minimum magnitude of the action that will trigger a buy or sell decision. In this example, it is set to 0.1.
Total Timesteps: This parameter specifies the total number of timesteps for which the agents will be trained. Here, it is set to 10,000 timesteps.

Environment Creation and Agent Training:

Environment Creation: This step initializes the training and validation environments using the StockTradingEnv class, tailored to the provided stock data.
Agent Training: The create_env_and_train_agents function trains various reinforcement learning agents (PPO, A2C, DDPG, SAC, TD3) using the training environment. Each agent is trained for a specified number of timesteps.
Ensemble Agent: An ensemble agent, which combines the predictions of all individual models, is also trained. This approach aims to leverage the strengths of each model and potentially improve overall performance.

The returned objects include the trained environments and agents, which are then ready for further evaluation and performance analysis.

# Create the environment and train the agents
threshold = 0.1
total_timesteps = 10000
train_env, val_env, ppo_agent, a2c_agent, ddpg_agent, sac_agent, td3_agent, ensemble_agent = \
  create_env_and_train_agents(training_data, validation_data, total_timesteps, threshold)

We can also test & visualize the agents:

n_tests = 1000
agents = {
    'PPO Agent': ppo_agent,
    'A2C Agent': a2c_agent,
    'DDPG Agent': ddpg_agent,
    'SAC Agent': sac_agent,
    'TD3 Agent': td3_agent,
    'Ensemble Agent': ensemble_agent
}

test_and_visualize_agents(train_env, agents, training_data, n_tests=n_tests)

test_env = DummyVecEnv([lambda: StockTradingEnv(test_data)])
test_and_visualize_agents(test_env, agents, test_data, n_tests=n_tests)

The corresponding results shall be as following:

We also compare the agents’ performance on the test data (returns, standard deviation, and sharpe ratio).

From the paper:

The higher an agent’s Sharpe ratio, the better its returns have been relative to the amount of investment risk it has taken. Therefore, we pick the trading agent that can maximize the returns adjusted to the increasing risk.

test_agents_metrics = [test_agent(test_env, agent, test_data, n_tests=n_tests, visualize=False) for agent in agents.values()]
compare_and_plot_agents(test_agents_metrics, list(agents.keys()))

The corresponding results shall be as following:

Lastly, we can also use the model to suggest next-day recommendations:

def prepare_next_day_data(stock_data):
    """ Prepares the observation for the next trading day """

    # Initialize the environment with the current stock data
    env = StockTradingEnv(stock_data)
    env.reset()

    # Prepare the next day's observation
    next_day_observations = env._next_observation()

    return next_day_observations

# -----------------------------------------------------------------------------

def generate_next_day_recommendations(agents, next_day_observation):
    """ Generate recommendations for the next trading day using the trained agents """

    recommendations = {agent_name: [] for agent_name in agents.keys()}

    for agent_name, agent in agents.items():
        action = agent.predict(next_day_observation)
        recs = agent.action_to_recommendation(action)
        recommendations[agent_name] = zip(recs, action)

    return recommendations

# -----------------------------------------------------------------------------

# Prepare next day's observation
next_day_observation = prepare_next_day_data(test_data)

# Generate recommendations for the next trading day
recommendations = generate_next_day_recommendations(agents, next_day_observation)

# Print or display recommendations
for agent_name, recs in recommendations.items():
  if agent_name == 'Ensemble Agent':
      print(f'\nRecommendations for {agent_name}:')
      for ticker, recommendation in zip(tickers, recs):
          print(f"{ticker}: {recommendation}")

The corresponding results shall be as following:

Conclusions

We’ve navigated the intricate process of setting up and training reinforcement learning agents for stock trading using a custom trading environment. We began by designing a comprehensive environment that captures the nuances of stock trading, including transaction costs, state observations, and reward calculations. With this environment in place, we trained a variety of reinforcement learning agents — PPO, A2C, DDPG, SAC, and TD3 — each contributing its unique strengths to the trading strategy. By also implementing an ensemble agent that combines the predictions of all individual models, we aimed to maximize performance and robustness.

Our exploration demonstrates how these advanced algorithms can be applied to real-world trading scenarios, highlighting their potential to adapt and make informed decisions based on market data. The insights gained from this exercise not only showcase the power of reinforcement learning in finance but also emphasize the importance of rigorous evaluation and visualization in assessing agent performance. By continually refining our models and analyzing their results, we can strive towards more effective trading strategies and deeper understanding of market dynamics.

Deep Reinforcement Learning for Automated Stock Trading

References:

Overview

Deep Reinforcement Learning (DRL)

Implementation

1. Function to Create Environment and Train Agents

2. Visualization Functions

3. Testing Functions

4. Performance Comparison Function

Conclusions

Written by Shahar Gino