RL-Exec: Impact-Aware Reinforcement Learning for Opportunistic Optimal Liquidation, Outperforms TWAP and a Book-Liquidity VWAP on BTC-USD Replays

ArXiv ID: 2511.07434 “View on arXiv”

Authors: Enzo Duflot, Stanislas Robineau

Abstract

We study opportunistic optimal liquidation over fixed deadlines on BTC-USD limit-order books (LOB). We present RL-Exec, a PPO agent trained on historical replays augmented with endogenous transient impact (resilience), partial fills, maker/taker fees, and latency. The policy observes depth-20 LOB features plus microstructure indicators and acts under a sell-only inventory constraint to reach a residual target. Evaluation follows a strict time split (train: Jan-2020; test: Feb-2020) and a per-day protocol: for each test day we run ten independent start times and aggregate to a single daily score, avoiding pseudo-replication. We compare the agent to (i) TWAP and (ii) a VWAP-like baseline allocating using opposite-side order-book liquidity (top-20 levels), both executed on identical timestamps and costs. Statistical inference uses one-sided Wilcoxon signed-rank tests on daily RL-baseline differences with Benjamini-Hochberg FDR correction and bootstrap confidence intervals. On the Feb-2020 test set, RL-Exec significantly outperforms both baselines and the gap increases with the execution horizon (+2-3 bps at 30 min, +7-8 bps at 60 min, +23 bps at 120 min). Code: github.com/Giafferri/RL-Exec

Keywords: PPO agent, limit order book, transient impact, market microstructure, TWAP/VWAP, Cryptocurrency (BTC-USD)

Complexity vs Empirical Score

  • Math Complexity: 6.0/10
  • Empirical Rigor: 8.5/10
  • Quadrant: Holy Grail
  • Why: The paper employs advanced reinforcement learning (PPO) and impact models from optimal execution theory (e.g., Almgren-Chriss, Obizhaeva-Wang), indicating moderate-to-high mathematical density. It is highly empirically rigorous with real BTC-USD LOB data, backtested on historical replays with explicit transaction costs and impact, and uses robust statistical inference (Wilcoxon tests, FDR correction, bootstrapping).
  flowchart TD
    A["Research Goal: Find optimal liquidation strategy\nfor BTC-USD LOBs under fixed deadlines"] --> B["Data & Simulation Engine"]

    B["Data & Simulation Engine"] --> C["Historical BTC-USD LOB Replays\nJan 2020 (Train) | Feb 2020 (Test)"]
    B --> D["Realistic Execution Environment\nIncludes: Partial fills, fees, latency, transient impact"]

    C & D --> E["Model Architecture"]
    subgraph E ["Computational Process"]
        E1["PPO Agent"] --> E2["Observation: Depth-20 LOB + Microstructure"]
        E3["Action: Sell-only inventory control"] --> E4["Output: Execution Schedule"]
    end

    E --> F["Comparison & Inference"]
    subgraph F ["Evaluation"]
        F1["RL-Exec"]
        F2["TWAP Baseline"]
        F3["VWAP-like Baseline<br>(Order-book Liquidity Alloc.)"]
    end

    F --> G["Key Findings / Outcomes"]
    subgraph G ["Feb 2020 Test Set"]
        G1["RL-Exec significantly outperforms TWAP & VWAP"]
        G2["Performance gap widens with horizon"]
        G3["Stats: 1-sided Wilcoxon + Benjamini-Hochberg FDR"]
    end