false

All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection ArXiv ID: 2601.04160 “View on arXiv” Authors: Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu, Chen Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou Abstract We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings. ...

January 7, 2026 · 2 min · Research Team

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements ArXiv ID: 2506.08762 “View on arXiv” Authors: Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha Abstract Financial analysis presents complex challenges that could leverage large language model (LLM) capabilities. However, the scarcity of challenging financial datasets, particularly for Japanese financial data, impedes academic innovation in financial analytics. As LLMs advance, this lack of accessible research resources increasingly hinders their development and evaluation in this specialized domain. To address this gap, we introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate the performance of LLMs on challenging financial tasks including accounting fraud detection, earnings forecasting, and industry prediction. EDINET-Bench is constructed by downloading annual reports from the past 10 years from Japan’s Electronic Disclosure for Investors’ NETwork (EDINET) and automatically assigning labels corresponding to each evaluation task. Our experiments reveal that even state-of-the-art LLMs struggle, performing only slightly better than logistic regression in binary classification for fraud detection and earnings forecasting. These results highlight significant challenges in applying LLMs to real-world financial applications and underscore the need for domain-specific adaptation. Our dataset, benchmark construction code, and evaluation code is publicly available to facilitate future research in finance with LLMs. ...

June 10, 2025 · 2 min · Research Team

Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation

Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation ArXiv ID: 2506.07315 “View on arXiv” Authors: Zonghan Wu, Congyuan Zou, Junlin Wang, Chenhan Wang, Hangjing Yang, Yilei Shao Abstract Generative AI, particularly large language models (LLMs), is beginning to transform the financial industry by automating tasks and helping to make sense of complex financial information. One especially promising use case is the automatic creation of fundamental analysis reports, which are essential for making informed investment decisions, evaluating credit risks, guiding corporate mergers, etc. While LLMs attempt to generate these reports from a single prompt, the risks of inaccuracy are significant. Poor analysis can lead to misguided investments, regulatory issues, and loss of trust. Existing financial benchmarks mainly evaluate how well LLMs answer financial questions but do not reflect performance in real-world tasks like generating financial analysis reports. In this paper, we propose FinAR-Bench, a solid benchmark dataset focusing on financial statement analysis, a core competence of fundamental analysis. To make the evaluation more precise and reliable, we break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning. This structured approach allows us to objectively assess how well LLMs perform each step of the process. Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis and provide a more practical way to benchmark their performance in real-world financial settings. ...

May 22, 2025 · 2 min · Research Team

FinTSBridge: A New Evaluation Suite for Real-world Financial Prediction with Advanced Time Series Models

FinTSBridge: A New Evaluation Suite for Real-world Financial Prediction with Advanced Time Series Models ArXiv ID: 2503.06928 “View on arXiv” Authors: Unknown Abstract Despite the growing attention to time series forecasting in recent years, many studies have proposed various solutions to address the challenges encountered in time series prediction, aiming to improve forecasting performance. However, effectively applying these time series forecasting models to the field of financial asset pricing remains a challenging issue. There is still a need for a bridge to connect cutting-edge time series forecasting models with financial asset pricing. To bridge this gap, we have undertaken the following efforts: 1) We constructed three datasets from the financial domain; 2) We selected over ten time series forecasting models from recent studies and validated their performance in financial time series; 3) We developed new metrics, msIC and msIR, in addition to MSE and MAE, to showcase the time series correlation captured by the models; 4) We designed financial-specific tasks for these three datasets and assessed the practical performance and application potential of these forecasting models in important financial problems. We hope the developed new evaluation suite, FinTSBridge, can provide valuable insights into the effectiveness and robustness of advanced forecasting models in finanical domains. ...

March 10, 2025 · 2 min · Research Team

LOB-Bench: Benchmarking Generative AI for Finance -- an Application to Limit Order Book Data

LOB-Bench: Benchmarking Generative AI for Finance – an Application to Limit Order Book Data ArXiv ID: 2502.09172 “View on arXiv” Authors: Unknown Abstract While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains “market impact metrics”, i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes. ...

February 13, 2025 · 2 min · Research Team

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent

INVESTORBENCH: A Benchmark for Financial Decision-Making Tasks with LLM-based Agent ArXiv ID: 2412.18174 “View on arXiv” Authors: Unknown Abstract Recent advancements have underscored the potential of large language model (LLM)-based agents in financial decision-making. Despite this progress, the field currently encounters two main challenges: (1) the lack of a comprehensive LLM agent framework adaptable to a variety of financial tasks, and (2) the absence of standardized benchmarks and consistent datasets for assessing agent performance. To tackle these issues, we introduce \textsc{“InvestorBench”}, the first benchmark specifically designed for evaluating LLM-based agents in diverse financial decision-making contexts. InvestorBench enhances the versatility of LLM-enabled agents by providing a comprehensive suite of tasks applicable to different financial products, including single equities like stocks, cryptocurrencies and exchange-traded funds (ETFs). Additionally, we assess the reasoning and decision-making capabilities of our agent framework using thirteen different LLMs as backbone models, across various market environments and tasks. Furthermore, we have curated a diverse collection of open-source, multi-modal datasets and developed a comprehensive suite of environments for financial decision-making. This establishes a highly accessible platform for evaluating financial agents’ performance across various scenarios. ...

December 24, 2024 · 2 min · Research Team

FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models

FinPT: Financial Risk Prediction with Profile Tuning on Pretrained Foundation Models ArXiv ID: 2308.00065 “View on arXiv” Authors: Unknown Abstract Financial risk prediction plays a crucial role in the financial sector. Machine learning methods have been widely applied for automatically detecting potential risks and thus saving the cost of labor. However, the development in this field is lagging behind in recent years by the following two facts: 1) the algorithms used are somewhat outdated, especially in the context of the fast advance of generative AI and large language models (LLMs); 2) the lack of a unified and open-sourced financial benchmark has impeded the related research for years. To tackle these issues, we propose FinPT and FinBench: the former is a novel approach for financial risk prediction that conduct Profile Tuning on large pretrained foundation models, and the latter is a set of high-quality datasets on financial risks such as default, fraud, and churn. In FinPT, we fill the financial tabular data into the pre-defined instruction template, obtain natural-language customer profiles by prompting LLMs, and fine-tune large foundation models with the profile text to make predictions. We demonstrate the effectiveness of the proposed FinPT by experimenting with a range of representative strong baselines on FinBench. The analytical studies further deepen the understanding of LLMs for financial risk prediction. ...

July 22, 2023 · 2 min · Research Team

Fundamental Indexation

Fundamental Indexation ArXiv ID: ssrn-713865 “View on arXiv” Authors: Unknown Abstract A trillion-dollar industry is based on investing in or benchmarking to capitalization-weighted indexes, even though the finance literature rejects the mean-vari Keywords: capitalization-weighted indexes, mean-variance, passive investing, benchmarking, portfolio optimization, Equities Complexity vs Empirical Score Math Complexity: 2.0/10 Empirical Rigor: 8.0/10 Quadrant: Street Traders Why: The paper presents a straightforward, intuitive strategy (fundamental indexing) with minimal mathematical derivations, but heavily relies on empirical backtests, real-world benchmark comparisons, and data analysis to challenge capitalization-weighted norms. flowchart TD A["Research Goal:<br/>Test if capitalization-weighted indexes<br/>are truly optimal"] --> B["Methodology:<br/>Compare Cap-Weighted vs.<br/>Fundamental Indexation"] B --> C["Data: Equities &<br/>Fundamental Metrics"] C --> D["Computation:<br/>Mean-Variance Optimization<br/>& Portfolio Simulation"] D --> E["Key Finding:<br/>Fundamental Indexation<br/>Outperforms Cap-Weighting"] E --> F["Outcome:<br/>Rejection of passive indexing<br/>as mean-variance efficient"]

May 5, 2005 · 1 min · Research Team

Fundamental Indexation

Fundamental Indexation ArXiv ID: ssrn-604842 “View on arXiv” Authors: Unknown Abstract A trillion-dollar industry is based on investing in or benchmarking to capitalization-weighted indexes, even though the finance literature rejects the mean-vari Keywords: capitalization-weighted indexes, mean-variance, passive investing, benchmarking, portfolio optimization, Equities Complexity vs Empirical Score Math Complexity: 4.0/10 Empirical Rigor: 8.0/10 Quadrant: Street Traders Why: The paper involves moderate mathematical finance concepts like portfolio optimization and benchmark analysis, but it is heavily data-driven, featuring extensive backtesting, real-world index performance comparisons, and discussion of implementation for a trillion-dollar industry. flowchart TD A["Research Goal<br>Test: Does capitalization weighting<br>violate mean-variance efficiency?"] --> B["Methodology<br>Constrained Optimization<br>vs. Capitalization Weighting"] B --> C["Input: Historical Returns<br>U.S. Large Cap Equities"] C --> D["Computational Process<br>Maximize Sharpe Ratio<br>Under Optimization Constraints"] D --> E{"Key Finding 1: Efficiency<br>Optimal Portfolio Sharpe Ratio<br>> Cap-Weighted Portfolio?"} E -- Yes --> F["Outcome: Cap-weighting is<br>Mean-Variance Inefficient"] E -- No --> G["Outcome: Cap-weighting is<br>Mean-Variance Efficient"] F --> H["Key Finding 2: Performance<br>Fundamental Indexation<br>Outperforms Cap-Weighting"] G --> H H --> I["Key Takeaway<br>Trillion-dollar cap-weighted industry<br>is suboptimal vs. optimized portfolios"]

October 15, 2004 · 1 min · Research Team