All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection
ArXiv ID: 2601.04160 “View on arXiv”
Authors: Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu, Chen Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou
Abstract
We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.
Keywords: misinformation detection, large language models, benchmarking, financial news analysis, reference free reasoning, General Finance
Complexity vs Empirical Score
- Math Complexity: 2.5/10
- Empirical Rigor: 7.0/10
- Quadrant: Street Traders
- Why: The paper is a benchmark construction and evaluation paper with minimal advanced mathematical formalisms, relying primarily on statistical metrics (accuracy, F1, AUROC) and LLM-based evaluation. It is highly empirical, featuring a curated dataset (RFC-Bench), explicit code/data release (GitHub link), and extensive implementation details for data curation and model experiments, making it highly backtest-ready.
flowchart TD
A["Research Goal<br>Create benchmark for reference-free<br>financial misinformation detection"] --> B["Methodology<br>Benchmark Design: RFC Bench"]
B --> C["Data & Inputs<br>Paired Original & Perturbed<br>Financial News Paragraphs"]
C --> D{"Computational Processes<br>LLM Evaluation"}
D --> E["Task 1: Reference-Free Detection<br>No external context provided"]
D --> F["Task 2: Comparative Diagnosis<br>Paired original/perturbed context provided"]
E --> G["Key Findings & Outcomes"]
F --> G
G --> H["1. Comparative context significantly<br>boosts performance vs reference-free"]
G --> I["2. Reference-free settings expose<br>weaknesses: unstable predictions,<br>elevated invalid outputs"]
G --> J["3. Models struggle to maintain<br>coherent belief states without<br>external grounding"]
G --> K["4. RFC Bench provides structured<br>testbed for advancing reliable<br>financial misinformation detection"]