Law-Strength Frontiers and a No-Free-Lunch Result for Law-Seeking Reinforcement Learning on Volatility Law Manifolds

ArXiv ID: 2511.17304 “View on arXiv”

Authors: Jian’an Zhang

Abstract

We study reinforcement learning (RL) on volatility surfaces through the lens of Scientific AI. We ask whether axiomatic no-arbitrage laws, imposed as soft penalties on a learned world model, can reliably align high-capacity RL agents, or mainly create Goodhart-style incentives to exploit model errors. From classical static no-arbitrage conditions we build a finite-dimensional convex volatility law manifold of admissible total-variance surfaces, together with a metric law-penalty functional and a Graceful Failure Index (GFI) that normalizes law degradation under shocks. A synthetic generator produces law-consistent trajectories, while a recurrent neural world model trained without law regularization exhibits structured off-manifold errors. On this testbed we define a Goodhart decomposition (r = r^{"\mathcal{M"}} + r^\perp), where (r^\perp) is ghost arbitrage from off-manifold prediction error. We prove a ghost-arbitrage incentive theorem for PPO-type agents, a law-strength trade-off theorem showing that stronger penalties eventually worsen P&L, and a no-free-lunch theorem: under a law-consistent world model and law-aligned strategy class, unconstrained law-seeking RL cannot Pareto-dominate structural baselines on P&L, penalties, and GFI. In experiments on an SPX/VIX-like world model, simple structural strategies form the empirical law-strength frontier, while all law-seeking RL variants underperform and move into high-penalty, high-GFI regions. Volatility thus provides a concrete case where reward shaping with verifiable penalties is insufficient for robust law alignment.

Keywords: Reinforcement Learning (RL), Volatility surface, No-arbitrage laws, Proximal Policy Optimization (PPO), Scientific AI, Derivatives (Options)

Complexity vs Empirical Score

  • Math Complexity: 9.5/10
  • Empirical Rigor: 4.0/10
  • Quadrant: Lab Rats
  • Why: The paper is dense with advanced theoretical results, including proofs of incentive and trade-off theorems, convex manifolds, and Goodhart decomposition, indicating very high mathematical complexity. Empirically, while it uses a synthetic testbed and specific metrics (e.g., GFI), the work is primarily theoretical and diagnostic with no real-world data, backtests, or code, placing it lower on empirical rigor.
  flowchart TD
    A["Research Goal"] --> B["Data: Law-Consistent Trajectories"]
    B --> C["Methodology: RL on World Model"]
    C --> D["Computations"]
    subgraph D [" "]
        D1["Train RNN World Model"]
        D2["Apply Law Penalty"]
        D3["Run PPO Agent"]
    end
    D --> E{"Outcomes"}
    E --> F["Ghost Arbitrage Incentive"]
    E --> G["Law-Strength Trade-off"]
    E --> H["No Free Lunch on Volatility Frontier"]