Deep reinforcement learning for optimal trading with partial information

ArXiv ID: 2511.00190 “View on arXiv”

Authors: Andrea Macrì, Sebastian Jaimungal, Fabrizio Lillo

Abstract

Reinforcement Learning (RL) applied to financial problems has been the subject of a lively area of research. The use of RL for optimal trading strategies that exploit latent information in the market is, to the best of our knowledge, not widely tackled. In this paper we study an optimal trading problem, where a trading signal follows an Ornstein-Uhlenbeck process with regime-switching dynamics. We employ a blend of RL and Recurrent Neural Networks (RNN) in order to make the most at extracting underlying information from the trading signal with latent parameters. The latent parameters driving mean reversion, speed, and volatility are filtered from observations of the signal, and trading strategies are derived via RL. To address this problem, we propose three Deep Deterministic Policy Gradient (DDPG)-based algorithms that integrate Gated Recurrent Unit (GRU) networks to capture temporal dependencies in the signal. The first, a one -step approach (hid-DDPG), directly encodes hidden states from the GRU into the RL trader. The second and third are two-step methods: one (prob-DDPG) makes use of posterior regime probability estimates, while the other (reg-DDPG) relies on forecasts of the next signal value. Through extensive simulations with increasingly complex Markovian regime dynamics for the trading signal’s parameters, as well as an empirical application to equity pair trading, we find that prob-DDPG achieves superior cumulative rewards and exhibits more interpretable strategies. By contrast, reg-DDPG provides limited benefits, while hid-DDPG offers intermediate performance with less interpretable strategies. Our results show that the quality and structure of the information supplied to the agent are crucial: embedding probabilistic insights into latent regimes substantially improves both profitability and robustness of reinforcement learning-based trading strategies.

Keywords: Reinforcement Learning, trading strategies, Ornstein-Uhlenbeck process, GRU, DDPG, pair trading

Complexity vs Empirical Score

  • Math Complexity: 7.5/10
  • Empirical Rigor: 6.0/10
  • Quadrant: Holy Grail
  • Why: The paper involves advanced stochastic calculus (Ornstein-Uhlenbeck SDEs, regime-switching Markov chains) and complex RL algorithms (DDPG with GRUs), indicating high math density. It is grounded in empirical testing via simulations and real equity pair trading data, though the backtest-ready implementation details are not fully specified in the summary.
  flowchart TD
    A["Research Goal: Optimal Trading<br>with Partial/Latent Info"] --> B["Data & Environment"]
    B --> C{"RL Agent Architecture"}
    
    subgraph B ["Inputs"]
        B1["Ornstein-Uhlenbeck Signal<br>with Regime-Switching"]
        B2["Market Observations"]
    end

    subgraph C ["Core Methodology: DDPG + GRU"]
        direction TB
        C1["GRU: Extract Temporal<br>Latent Features"]
        C2["RL Actor-Critic<br>DDPG Policy Optimization"]
        C1 --> C2
    end

    C --> D{"Algorithm Variants"}
    
    D --> E["hid-DDPG<br>Direct Hidden States"]
    D --> F["prob-DDPG<br>Posterior Regime<br>Probabilities"]
    D --> G["reg-DDPG<br>Signal Forecast"]

    E & F & G --> H["Simulation &<br>Equity Pair Trading"]
    
    H --> I{"Key Findings / Outcomes"}
    I --> J["prob-DDPG: Superior<br>Profit & Interpretability"]
    I --> K["reg-DDPG: Limited Benefit"]
    I --> L["hid-DDPG: Intermediate<br>Performance"]