Myopic Optimality: why reinforcement learning portfolio management strategies lose money
ArXiv ID: 2509.12764 “View on arXiv”
Authors: Yuming Ma
Abstract
Myopic optimization (MO) outperforms reinforcement learning (RL) in portfolio management: RL yields lower or negative returns, higher variance, larger costs, heavier CVaR, lower profitability, and greater model risk. We model execution/liquidation frictions with mark-to-market accounting. Using Malliavin calculus (Clark-Ocone/BEL), we derive policy gradients and risk shadow price, unifying HJB and KKT. This gives dual gap and convergence results: geometric MO vs. RL floors. We quantify phantom profit in RL via Malliavin policy-gradient contamination analysis and define a control-affects-dynamics (CAD) premium of RL indicating plausibly positive.
Keywords: myopic optimization, portfolio management, Malliavin calculus, policy gradients, execution frictions, Portfolio Management
Complexity vs Empirical Score
- Math Complexity: 9.5/10
- Empirical Rigor: 4.0/10
- Quadrant: Lab Rats
- Why: The paper is intensely mathematical, featuring advanced Malliavin calculus, stochastic SDEs, HJB/KKT unification, and dual gap analysis, indicating high complexity. However, the empirical evidence is largely conceptual and comparative (citing other works’ backtest failures), lacking original datasets, code, or detailed implementation metrics, resulting in lower rigor.
flowchart TD
A["Research Goal:<br>Why RL loses money in<br>Portfolio Management vs. Myopic Optimization"] --> B
subgraph B ["Methodology & Inputs"]
direction LR
B1["Model Execution/Liquidation<br>Frictions via Mark-to-Market"]
B2["Malliavin Calculus<br>Policy Gradients & Risk Shadow Price"]
B3["HJB & KKT Unification<br>Dual Gap Convergence"]
end
B --> C["Computational Process:<br>Myopic vs. RL Strategy Comparison"]
subgraph D ["Key Findings / Outcomes"]
D1["Myopic Optimization:<br>Superior Returns, Low Variance"]
D2["Reinforcement Learning:<br>Negative Returns, High CVaR, Model Risk"]
D3["Phantom Profit Analysis:<br>Control-Affects-Dynamics (CAD) Premium"]
end
C --> D