When Reasoning Fails: Evaluating ‘Thinking’ LLMs for Stock Prediction
ArXiv ID: 2511.08608 “View on arXiv”
Authors: Rakeshkumar H Sodha
Abstract
Problem. “Thinking” LLMs (TLLMs) expose explicit or hidden reasoning traces and are widely believed to generalize better on complex tasks than direct LLMs. Whether this promise carries to noisy, heavy-tailed and regime-switching financial data remains unclear. Approach. Using Indian equities (NIFTY constituents), we run a rolling 48m/1m walk-forward evaluation at horizon k = 1 day and dial cross-sectional complexity via the universe size U in {“5, 11, 21, 36”} while keeping the reasoning budget fixed (B = 512 tokens) for the TLLM. We compare a direct LLM (gpt-4o-mini), a TLLM (gpt-5), and classical learners (ridge, random forest) on cross-sectional ranking loss 1 - IC, MSE, and long/short backtests with realistic costs. Statistical confidence is measured with Diebold-Mariano, Pesaran-Timmermann, and SPA tests. Main findings. (i) As U grows under a fixed budget B, the TLLM’s ranking quality deteriorates, whereas the direct LLM remains flat and classical baselines are stable. (ii) TLLM variance is higher, requiring ex-post calibration (winsorization and blending) for stability. (iii) Portfolio results under transaction costs do not support a net advantage for the TLLM. Hypotheses. Our results are consistent with the following testable hypotheses: H1 (Capacity-Complexity Mismatch): for fixed B, TLLM accuracy degrades superlinearly in cross-sectional complexity. H2 (Reasoning Variance): TLLM outputs exhibit higher dispersion date-by-date than direct LLMs, increasing error bars and turnover. H3 (Domain Misfit): next-token prediction objectives and token-budgeted inference are poorly aligned with heavy-tailed, weakly predictable stock returns. Implication. In our setting, “thinking” LLMs are not yet ready to replace classical or direct methods for short-horizon stock ranking; scaling the reasoning budget and/or re-aligning objectives appears necessary.
Keywords: Financial forecasting, Large language models, Cross-sectional ranking, Performance evaluation, Statistical testing, Equities
Complexity vs Empirical Score
- Math Complexity: 6.5/10
- Empirical Rigor: 8.0/10
- Quadrant: Holy Grail
- Why: The paper employs advanced statistical metrics (DM, PT, SPA tests) and structured hypothesis testing (H1-H3), indicating moderate-to-high math complexity, and features a rigorous, multi-faceted empirical evaluation with walk-forward backtests, realistic transaction costs, and robustness checks, aligning it with the Holy Grail quadrant.
flowchart TD
A["Research Goal<br/>Test TLLMs vs. Direct LLMs on<br/>noisy financial data"] --> B["Methodology: Rolling Walk-Forward"]
B --> C["Data: NIFTY Constituents<br/>Complexity U = {"5, 11, 21, 36"}"]
C --> D["Models & Process<br/>TLLM (gpt-5) vs. Direct LLM (gpt-4o-mini)<br/>vs. Classical (Ridge, RF)<br/>Fixed Budget B = 512 tokens"]
D --> E{"Evaluation Metrics"}
E --> F["Ranking Loss 1-IC<br/>MSE<br/>Long/Short Backtests"]
E --> G["Statistical Tests<br/>Diebold-Mariano, SPA"]
F --> H["Key Findings"]
G --> H
H --> I((1. TLLM accuracy drops as U grows))
H --> J((2. TLLM variance higher<br/>requires calibration))
H --> K((3. No net advantage under costs))