Evaluation of Deep Reinforcement Learning Algorithms for Portfolio Optimisation
ArXiv ID: 2307.07694 “View on arXiv”
Authors: Unknown
Abstract
We evaluate benchmark deep reinforcement learning algorithms on the task of portfolio optimisation using simulated data. The simulator to generate the data is based on correlated geometric Brownian motion with the Bertsimas-Lo market impact model. Using the Kelly criterion (log utility) as the objective, we can analytically derive the optimal policy without market impact as an upper bound to measure performance when including market impact. We find that the off-policy algorithms DDPG, TD3 and SAC are unable to learn the right $Q$-function due to the noisy rewards and therefore perform poorly. The on-policy algorithms PPO and A2C, with the use of generalised advantage estimation, are able to deal with the noise and derive a close to optimal policy. The clipping variant of PPO was found to be important in preventing the policy from deviating from the optimal once converged. In a more challenging environment where we have regime changes in the GBM parameters, we find that PPO, combined with a hidden Markov model to learn and predict the regime context, is able to learn different policies adapted to each regime. Overall, we find that the sample complexity of these algorithms is too high for applications using real data, requiring more than 2m steps to learn a good policy in the simplest setting, which is equivalent to almost 8,000 years of daily prices.
Keywords: Reinforcement learning, PPO, Portfolio optimization, Market impact, Geometric Brownian motion, Equities
Complexity vs Empirical Score
- Math Complexity: 7.5/10
- Empirical Rigor: 6.0/10
- Quadrant: Holy Grail
- Why: The paper employs advanced mathematics, including stochastic calculus (Itô’s lemma), SDEs, and regime switching models, while also using a well-defined simulated environment and analytical benchmarks to rigorously evaluate algorithm performance, though it lacks real-world data backtesting.
flowchart TD
A["Research Goal: Evaluate DRL Algorithms<br>for Portfolio Optimisation"] --> B{"Methodology"}
B --> C["Simulation: GBM +<br>Bertsimas-Lo Market Impact"]
B --> D["Objective: Kelly Criterion<br>Log Utility"]
B --> E["Benchmark: Analytical<br>Optimal Policy"]
C & D --> F["Algorithm Evaluation"]
F --> G["Off-Policy Failure<br>DDPG/TD3/SAC<br>Noisy Rewards"]
F --> H["On-Policy Success<br>PPO/A2C + GAE<br>Close to Optimal"]
H --> I["Key Findings"]
I --> J["PPO with HMM adapts<br>to regime changes"]
I --> K["High sample complexity<br>>2M steps equivalent<br>to ~8000 years"]