Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation

ArXiv ID: 2506.07315 “View on arXiv”

Authors: Zonghan Wu, Congyuan Zou, Junlin Wang, Chenhan Wang, Hangjing Yang, Yilei Shao

Abstract

Generative AI, particularly large language models (LLMs), is beginning to transform the financial industry by automating tasks and helping to make sense of complex financial information. One especially promising use case is the automatic creation of fundamental analysis reports, which are essential for making informed investment decisions, evaluating credit risks, guiding corporate mergers, etc. While LLMs attempt to generate these reports from a single prompt, the risks of inaccuracy are significant. Poor analysis can lead to misguided investments, regulatory issues, and loss of trust. Existing financial benchmarks mainly evaluate how well LLMs answer financial questions but do not reflect performance in real-world tasks like generating financial analysis reports. In this paper, we propose FinAR-Bench, a solid benchmark dataset focusing on financial statement analysis, a core competence of fundamental analysis. To make the evaluation more precise and reliable, we break this task into three measurable steps: extracting key information, calculating financial indicators, and applying logical reasoning. This structured approach allows us to objectively assess how well LLMs perform each step of the process. Our findings offer a clear understanding of LLMs current strengths and limitations in fundamental analysis and provide a more practical way to benchmark their performance in real-world financial settings.

Keywords: Large Language Models (LLMs), Fundamental Analysis, Financial Statement Analysis, Benchmarking, Natural Language Processing, Equities

Complexity vs Empirical Score

Math Complexity: 2.0/10
Empirical Rigor: 8.5/10
Quadrant: Street Traders
Why: The paper introduces a benchmark dataset (FinAR-Bench) for evaluating LLMs on financial statement analysis, with a focus on implementing and measuring specific subtasks (extraction, calculation, reasoning) using real-world financial data, which is highly empirical and implementation-heavy, while the mathematical content involves basic financial formulas and lacks advanced derivations.

  flowchart TD
    A["Research Goal<br/>'Measure LLM Competence in<br/>Fundamental Financial Analysis'"] --> B["Propose FinAR-Bench<br/>Benchmark Dataset"]
    B --> C{"Decompose Task into<br/>3 Measurable Steps"}
    C --> D["Step 1: Extract Key<br/>Financial Information"]
    C --> E["Step 2: Calculate<br/>Financial Indicators"]
    C --> E
    C --> F["Step 3: Apply Logical<br/>Reasoning & Synthesis"]
    D & E & F --> G["Compute Step-wise<br/>& Overall Accuracy Metrics"]
    G --> H["Key Findings<br/>- LLMs excel at extraction<br/>- Struggle with calculation accuracy<br/>- Limited logical reasoning<br/>- Provides practical<br/>benchmarking methodology"]

Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation#

Abstract#

Complexity vs Empirical Score#

Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation

Abstract

Complexity vs Empirical Score