RL-Exec: Impact-Aware Reinforcement Learning for Opportunistic Optimal Liquidation, Outperforms TWAP and a Book-Liquidity VWAP on BTC-USD Replays
ArXiv ID: 2511.07434 “View on arXiv”
Authors: Enzo Duflot, Stanislas Robineau
Abstract
We study opportunistic optimal liquidation over fixed deadlines on BTC-USD limit-order books (LOB). We present RL-Exec, a PPO agent trained on historical replays augmented with endogenous transient impact (resilience), partial fills, maker/taker fees, and latency. The policy observes depth-20 LOB features plus microstructure indicators and acts under a sell-only inventory constraint to reach a residual target. Evaluation follows a strict time split (train: Jan-2020; test: Feb-2020) and a per-day protocol: for each test day we run ten independent start times and aggregate to a single daily score, avoiding pseudo-replication. We compare the agent to (i) TWAP and (ii) a VWAP-like baseline allocating using opposite-side order-book liquidity (top-20 levels), both executed on identical timestamps and costs. Statistical inference uses one-sided Wilcoxon signed-rank tests on daily RL-baseline differences with Benjamini-Hochberg FDR correction and bootstrap confidence intervals. On the Feb-2020 test set, RL-Exec significantly outperforms both baselines and the gap increases with the execution horizon (+2-3 bps at 30 min, +7-8 bps at 60 min, +23 bps at 120 min). Code: github.com/Giafferri/RL-Exec
Keywords: PPO agent, limit order book, transient impact, market microstructure, TWAP/VWAP, Cryptocurrency (BTC-USD)
Complexity vs Empirical Score
- Math Complexity: 6.0/10
- Empirical Rigor: 8.5/10
- Quadrant: Holy Grail
- Why: The paper employs advanced reinforcement learning (PPO) and impact models from optimal execution theory (e.g., Almgren-Chriss, Obizhaeva-Wang), indicating moderate-to-high mathematical density. It is highly empirically rigorous with real BTC-USD LOB data, backtested on historical replays with explicit transaction costs and impact, and uses robust statistical inference (Wilcoxon tests, FDR correction, bootstrapping).
flowchart TD
A["Research Goal: Find optimal liquidation strategy\nfor BTC-USD LOBs under fixed deadlines"] --> B["Data & Simulation Engine"]
B["Data & Simulation Engine"] --> C["Historical BTC-USD LOB Replays\nJan 2020 (Train) | Feb 2020 (Test)"]
B --> D["Realistic Execution Environment\nIncludes: Partial fills, fees, latency, transient impact"]
C & D --> E["Model Architecture"]
subgraph E ["Computational Process"]
E1["PPO Agent"] --> E2["Observation: Depth-20 LOB + Microstructure"]
E3["Action: Sell-only inventory control"] --> E4["Output: Execution Schedule"]
end
E --> F["Comparison & Inference"]
subgraph F ["Evaluation"]
F1["RL-Exec"]
F2["TWAP Baseline"]
F3["VWAP-like Baseline<br>(Order-book Liquidity Alloc.)"]
end
F --> G["Key Findings / Outcomes"]
subgraph G ["Feb 2020 Test Set"]
G1["RL-Exec significantly outperforms TWAP & VWAP"]
G2["Performance gap widens with horizon"]
G3["Stats: 1-sided Wilcoxon + Benjamini-Hochberg FDR"]
end