LiveTradeBench: Seeking Real-World Alpha with Large Language Models

ArXiv ID: 2511.03628 “View on arXiv”

Authors: Haofei Yu, Fenghai Li, Jiaxuan You

Abstract

Large language models (LLMs) achieve strong performance across benchmarks–from knowledge quizzes and math reasoning to web-agent tasks–but these tests occur in static settings, lacking real dynamics and uncertainty. Consequently, they evaluate isolated reasoning or problem-solving rather than decision-making under uncertainty. To address this, we introduce LiveTradeBench, a live trading environment for evaluating LLM agents in realistic and evolving markets. LiveTradeBench follows three design principles: (i) Live data streaming of market prices and news, eliminating dependence on offline backtesting and preventing information leakage while capturing real-time uncertainty; (ii) a portfolio-management abstraction that extends control from single-asset actions to multi-asset allocation, integrating risk management and cross-asset reasoning; and (iii) multi-market evaluation across structurally distinct environments–U.S. stocks and Polymarket prediction markets–differing in volatility, liquidity, and information flow. At each step, an agent observes prices, news, and its portfolio, then outputs percentage allocations that balance risk and return. Using LiveTradeBench, we run 50-day live evaluations of 21 LLMs across families. Results show that (1) high LMArena scores do not imply superior trading outcomes; (2) models display distinct portfolio styles reflecting risk appetite and reasoning dynamics; and (3) some LLMs effectively leverage live signals to adapt decisions. These findings expose a gap between static evaluation and real-world competence, motivating benchmarks that test sequential decision making and consistency under live uncertainty.

Keywords: LLM agents, Portfolio management, Live trading, Reinforcement learning, Multi-market evaluation, Equities

Complexity vs Empirical Score

  • Math Complexity: 3.5/10
  • Empirical Rigor: 8.0/10
  • Quadrant: Street Traders
  • Why: The paper is focused on building and evaluating a live trading environment with real-world data streams and portfolio management, with minimal advanced mathematical theory. Its empirical rigor is high due to live data usage, multi-market evaluation, and deployment of 21 LLMs over 50 days.
  flowchart TD
    A["Research Goal: Evaluate LLMs in dynamic, real-world decision-making\nunder uncertainty to bridge the gap between static benchmarks\nand live trading competence."] --> B["LiveTradeBench Design Principles:\n1. Live Data Streaming (Prices/News)\n2. Portfolio-Management Abstraction (Multi-Asset/Risk)\n3. Multi-Market Evaluation (Stocks/Prediction Markets)"]
    
    B --> C["Methodology: Live Evaluation\n- 21 LLMs (across families)\n- 50-Day Live Trading\n- U.S. Stocks & Polymarket"]
    
    C --> D["Computational Process:\n1. Agent observes live market state\n2. Outputs percentage allocations\n3. Executes multi-asset risk-return balancing"]
    
    D --> E["Key Findings:\n1. High benchmark scores ≠ Trading alpha\n2. Distinct portfolio styles (risk appetite)\n3. LLMs adapt to live signals, exposing static vs. real-world competence gap."]