Tail-Safe Hedging: Explainable Risk-Sensitive Reinforcement Learning with a White-Box CBF–QP Safety Layer in Arbitrage-Free Markets
ArXiv ID: 2510.04555 “View on arXiv”
Authors: Jian’an Zhang
Abstract
We introduce Tail-Safe, a deployability-oriented framework for derivatives hedging that unifies distributional, risk-sensitive reinforcement learning with a white-box control-barrier-function (CBF) quadratic-program (QP) safety layer tailored to financial constraints. The learning component combines an IQN-based distributional critic with a CVaR objective (IQN–CVaR–PPO) and a Tail-Coverage Controller that regulates quantile sampling through temperature tilting and tail boosting to stabilize small-$α$ estimation. The safety component enforces discrete-time CBF inequalities together with domain-specific constraints – ellipsoidal no-trade bands, box and rate limits, and a sign-consistency gate – solved as a convex QP whose telemetry (active sets, tightness, rate utilization, gate scores, slack, and solver status) forms an auditable trail for governance. We provide guarantees of robust forward invariance of the safe set under bounded model mismatch, a minimal-deviation projection interpretation of the QP, a KL-to-DRO upper bound linking per-state KL regularization to worst-case CVaR, concentration and sample-complexity results for the temperature-tilted CVaR estimator, and a CVaR trust-region improvement inequality under KL limits, together with feasibility persistence under expiry-aware tightening. Empirically, in arbitrage-free, microstructure-aware synthetic markets (SSVI $\to$ Dupire $\to$ VIX with ABIDES/MockLOB execution), Tail-Safe improves left-tail risk without degrading central performance and yields zero hard-constraint violations whenever the QP is feasible with zero slack. Telemetry is mapped to governance dashboards and incident workflows to support explainability and auditability. Limitations include reliance on synthetic data and simplified execution to isolate methodological contributions.
Keywords: Control barrier functions, Distributional RL, CVaR hedging, IQN, Risk-sensitive reinforcement learning, Derivatives
Complexity vs Empirical Score
- Math Complexity: 9.2/10
- Empirical Rigor: 4.5/10
- Quadrant: Lab Rats
- Why: The paper is mathematically dense with advanced control theory, distributional RL, and DRO proofs, but the empirical evaluation is limited to synthetic markets and lacks real-world data or full backtesting pipelines.
flowchart TD
A["Research Goal<br>Deployable, explainable<br>tail-safe hedging"] --> B["Data & Inputs"]
B --> C["Key Methodology"]
C --> D["Computational Process"]
D --> E["Key Findings & Outcomes"]
subgraph B ["Data & Inputs"]
B1["SSVI → Dupire → VIX<br>Market Simulator"]
B2["Arbitrage-free,<br>microstructure-aware"]
end
subgraph C ["Key Methodology"]
C1["Distributional RL<br>IQN-CVaR-PPO"]
C2["White-Box Safety Layer<br>CBF-QP"]
C3["Tail-Coverage Controller<br>Quantile Sampling"]
end
subgraph D ["Computational Process"]
D1["Learn Policy & Critic<br>(IQN)"]
D2["Enforce Constraints<br>Ellipsoidal/Box/Rate Limits"]
D3["Solve Convex QP<br>Project to Safe Set"]
D4["Generate Audit Trail<br>Active Sets, Slack, Status"]
end
subgraph E ["Key Findings & Outcomes"]
E1["Robust Forward Invariance<br>Bounded Model Mismatch"]
E2["Zero Hard Violations<br>When QP Feasible"]
E3["KL-to-DRO Upper Bound<br>Per-state CVaR Guarantees"]
E4["Governance Dashboards<br>Explainability & Auditability"]
end
B1 --> C1 & C2 & C3
C1 & C2 & C3 --> D1 & D2 & D3 & D4
D4 --> E1 & E2 & E3 & E4