Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification
ArXiv ID: 2601.03948 “View on arXiv”
Authors: Rui Sun, Yifan Sun, Sheng Xu, Li Zhao, Jing Li, Daxin Jiang, Cheng Hua, Zuo Bai
Abstract
Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market’s stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.
Keywords: Reinforcement Learning (RL), Large Language Models (LLMs), Verifiable Rewards, Retrieval-Augmented Generation (RAG), Stochastic Environments
Complexity vs Empirical Score
- Math Complexity: 6.0/10
- Empirical Rigor: 8.0/10
- Quadrant: Holy Grail
- Why: The paper employs advanced concepts from reinforcement learning (RL), large language models (LLMs), and structured verification metrics, indicating substantial mathematical sophistication. It also demonstrates strong empirical rigor through multi-market experiments (A-Share and US), backtest-style portfolio construction, and the proposal of specific, implementable frameworks like the Triangular Verification Protocol and semantic reward strategies.
flowchart TD
A["Research Goal:<br>Extend LLM RL to<br>Stochastic Financial Markets"] --> B["Input: Financial Docs &<br>Market Data"]
B --> C{"Process: Trade-R1 Framework<br>via Process-Level Verification"}
C --> D["Step 1: RAG-based<br>Evidence Retrieval"]
D --> E["Step 2: Triangular Consistency<br>Verification Metric"]
E --> F["Step 3: Reward Integration<br>FSR / DSR"]
F --> G["Outcome: Trade-R1 Model<br>Reduced Reward Hacking<br>DSR: Superior Generalization"]