Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

ArXiv ID: 2507.22936 “View on arXiv”

Authors: Md Talha Mohsin

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the ‘Magnificent Seven’ technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.

Keywords: Financial Natural Language Processing (FinNLP), Model Comparative Evaluation, ROUGE/Cosine Similarity Metrics, 10-K Filings Analysis, Large Language Models (LLMs), Equity (Technology Sector)

Complexity vs Empirical Score

Math Complexity: 2.5/10
Empirical Rigor: 8.0/10
Quadrant: Street Traders
Why: The paper’s mathematics is minimal, relying on standard statistical metrics like ROUGE and Cosine Similarity, but its empirical rigor is high, involving a systematic comparison of five LLMs on real financial data (10-K filings), domain-specific prompting, and multiple evaluation methodologies.

  flowchart TD
    A["Research Goal:<br>Evaluate LLMs on Financial NLP<br>for 10-K Report Analysis"] --> B["Data Source:<br>10-K Filings from<br>'Magnificent Seven' Tech Firms"]
    B --> C["Methodology:<br>3-Pronged Evaluation"]
    C --> D["Human Annotation<br>(Qualitative Assessment)"]
    C --> E["Automated Metrics<br>ROUGE / Cosine / Jaccard"]
    C --> F["Diagnostics<br>Variance & Similarity"]
    D & E & F --> G["Key Findings"]
    G --> H1["GPT: Best Coherence & Alignment"]
    G --> H2["Claude/Perplexity: Moderate Performance"]
    G --> H3["Gemini/DeepSeek: High Variability"]
    G --> H4["Performance Sensitive to Prompting & Context"]

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis#

Abstract#

Complexity vs Empirical Score#

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Abstract

Complexity vs Empirical Score